🤖 AI Summary
To address the challenge of efficiently aligning large language models (LLMs) with human values during inference, this paper proposes a decoding-time alignment algorithm. It partitions each generated output into fixed-length token segments, performs fine-grained scoring and rejection sampling per segment, and dynamically accepts or resamples segments based on real-time reward model feedback—enabling early error correction and online regulation. Unlike costly supervised fine-tuning (SFT) or computationally intractable full-sequence ranking, our approach achieves high alignment quality without parameter updates or exhaustive search. Experiments across six mainstream LLMs show that our method improves over SFT by up to 14.9 percentage points, outperforms DPO by 4.3 points, and matches the performance of the strong baseline Best-of-N. The core contribution lies in a novel, reward-guided, segment-wise decoding-time alignment mechanism—establishing a lightweight, real-time, and high-fidelity paradigm for value alignment.
📝 Abstract
Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.