Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference overhead in aligning large language models (LLMs) with human preferences during testing, this paper proposes Reward-Shifted Speculative Sampling (RSS). RSS requires no additional training and leverages the reward distribution shift between a weakly aligned draft model and a strongly unaligned target model to dynamically adjust the acceptance criterion and token-level reward modeling in speculative decoding—enabling efficient “weak-to-strong” alignment at inference time. Its core innovation lies in embedding human preference signals directly into the speculative sampling mechanism, allowing the draft model to guide the target model toward outputs with higher preference-aligned rewards. Experiments across multiple benchmarks demonstrate that RSS significantly reduces inference latency—by up to 2.3×—while improving gold reward scores by +4.2%, outperforming both standard RLHF and existing speculative sampling methods.

Technology Category

Application Category

📝 Abstract
Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Reducing high inference costs of test-time alignment techniques
Aligning large language models with human preferences efficiently
Improving weak-to-strong alignment without modifying target model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-Shifted Speculative Sampling algorithm
Aligned draft model with unaligned target
Modified acceptance and bonus token distribution
🔎 Similar Papers
No similar papers found.