🤖 AI Summary
This work addresses the limitations of conventional large language model decoding, which relies on fixed sampling strategies and fails to adapt to the dynamic difficulty across prompts and decoding steps. The authors propose a lightweight decoding adapter that dynamically selects the optimal sampling strategy—such as top-k, min-p, or greedy—during inference based on prompt embeddings, available computational resources, and budget constraints. Innovatively, sequence-level decoding is formulated as a contextual bandit problem, while token-level decoding is modeled as a partially observable Markov decision process, both trained via reinforcement learning with verifiable terminal rewards. Evaluated on the MATH and CodeContests benchmarks, the approach significantly improves the accuracy–budget trade-off: on MATH, the token-level adapter achieves up to a 10.2% absolute gain in Pass@1 accuracy under a fixed token budget, and the sequence-level adapter yields a 2–3% improvement under fixed parallel sampling.
📝 Abstract
Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.