The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge that pretrained large language models often fail to effectively follow instructions under standard decoding, conflating evaluation outcomes with decoding limitations rather than true capabilities. To resolve this, the authors propose Energy-Based Decoding (EBD)—a training-free, inference-time framework that leverages a lightweight external reward model to guide a frozen pretrained model toward generating high-value, task-oriented outputs. EBD is the first method to substantially unlock a model’s intrinsic task-solving abilities without parameter updates, enabling fairer capability assessment across both open-ended and goal-directed tasks. Experiments demonstrate consistent gains over existing approaches across five models and six benchmarks: for instance, Qwen3-8B-Base achieves a leap from 8.8 to 44.5 on AlpacaEval 2.0, while Mistral-7B reduces reasoning latency by 18.9× on Math500, with strong robustness to reward model scale.

📝 Abstract

With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token prediction and often fail to follow instructions or produce well-formed answers under standard prompting and direct decoding. As a result, benchmark performance can conflate model capability with decoding-induced failures to produce task-oriented outputs, while exposing such behavior often relies on costly post-training. Recent decodingonly approaches attempt to reshape output distributions, but such methods can be inefficient and brittle across open-ended tasks. To address these limitations, we propose Energy-Based Decoding (EBD), a training-free, reward-guided framework for activating task-oriented behaviors from frozen pre-trained LLMs across both open-ended and objective tasks. EBD augments decoding with an external lightweight reward model, steering generations toward high-utility responses while anchoring them to the pre-trained model prior through a reward-tilted target distribution. We show that EBD shifts base-model outputs toward more instructionfollowing behavior, increasing behavioral similarity to post-trained counterparts and enabling a fairer inference-time evaluation of accessible pre-trained-model behavior. Empirically, EBD outperforms baselines across five models and six benchmarks, improving Qwen3-8B-Base on AlpacaEval2.0 from 8.8 to 44.5, reducing Mistral-7B Math500 latency by 18.9x relative to prior decoding work, and remaining robust to reward-model size.

Problem

Research questions and friction points this paper is trying to address.

pre-trained model evaluation

task-oriented behavior

reward-guided decoding

instruction following

decoding-induced failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-Based Decoding

reward-guided decoding

pre-trained LLM evaluation