Cascade Reward Sampling for Efficient Decoding-Time Alignment

📅 2024-06-24
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Decoding-time alignment of large language models (LLMs) with human preferences suffers from computational inefficiency due to redundant token generation and frequent reward model (RM) invocations. To address this, we propose Cascaded Reward-guided Sampling (CARDS), the first method leveraging RMs’ predictive capability on incomplete sequences. CARDS integrates LLM prediction uncertainty estimation with dynamic semantic-segment rejection sampling to construct high-reward, high-likelihood intermediate prefixes, which then guide efficient subsequent generation. This enables prefix-level quality filtering and generation co-optimization during decoding, avoiding wasteful expansion of low-quality continuations. Experiments demonstrate that CARDS accelerates inference by 5× while achieving a 99% win rate over baseline methods in GPT-4 and Claude-3 helpfulness evaluations. It significantly reduces redundant token regeneration and RM query count, effectively balancing alignment quality and decoding efficiency.

Technology Category

Application Category

📝 Abstract
Aligning large language models (LLMs) with human preferences is critical for their deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play technique that requires no fine-tuning of model parameters. However, generating text that achieves both high reward and high likelihood remains a significant challenge. Existing methods often fail to generate high-reward text or incur substantial computational costs. In this paper, we propose Cascade Reward Sampling (CARDS) to address both issues, guaranteeing the generation of high-reward and high-likelihood text with significantly low costs. Based on our analysis of reward models (RMs) on incomplete text and our observation that high-reward prefixes induce high-reward complete text, we use rejection sampling to iteratively generate small semantic segments to form such prefixes. The segment length is dynamically determined by the predictive uncertainty of LLMs. This strategy guarantees desirable prefixes for subsequent generations and significantly reduces wasteful token re-generations and the number of reward model scoring. Our experiments demonstrate substantial gains in both generation efficiency and alignment ratings compared to the baselines, achieving five times faster text generation and 99% win-ties in GPT-4/Claude-3 helpfulness evaluation.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency of decoding-time alignment in LLMs
Reduces redundant token generation and reward evaluations
Enhances alignment quality and general utility of LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-level rejection sampling for efficiency
Uncertainty-based segmentation for accurate evaluations
Reduces decoding time by 70% effectively
🔎 Similar Papers
No similar papers found.