🤖 AI Summary
This work identifies a fundamental flaw in existing real-time reward-guided text generation (RGTG) methods that rely on full-sequence reward models: their scores are incompatible with partial decoding sequences and lack theoretical grounding. To address this, the authors first formalize the misalignment between full-sequence reward modeling and autoregressive token-level decoding. They then propose a novel, Bradley–Terry-based ordinal modeling framework for segment-level reward learning, which explicitly captures subsequence preferences and yields an analytically tractable token-level policy expression. Crucially, the approach requires no fine-tuning of large language models; instead, it achieves efficient reward guidance via lightweight reward modeling and standard autoregressive sampling. Experiments across multiple benchmarks demonstrate that the method significantly outperforms prior RGTG approaches and matches the performance of strong offline fine-tuning baselines.
📝 Abstract
Large language models (LLMs) can be improved by aligning with human preferences through fine-tuning -- the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM fine-tuning, prediction-time tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during decoding in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this issue, we propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding time. We study the properties of this reward model and the resulting policy: we show that this policy is proportional to the ratio of two distinct RLHF policies. Our simple approach outperforms previous RGTG methods and performs similarly to strong offline baselines without large-scale LLM finetuning.