Streaming Looking Ahead with Token-level Self-reward

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the suboptimal decision-making in autoregressive streaming generation—caused by exclusive reliance on historical context—this paper proposes a real-time lookahead mechanism that operates without external reward models. Our method builds upon the Reward Transformer architecture and integrates seamlessly with reinforcement fine-tuning techniques such as DPO. Key contributions include: (1) token-level self-reward modeling (TRM), a lightweight, differentiable reward predictor embedded within the policy model; and (2) streaming lookahead (SLA), an efficient parallel search algorithm enabling low-latency responses. Experiments demonstrate that, with a frozen policy model, SLA achieves a 79.7% win rate over greedy decoding; integrating DPO further improves this to 89.4%. These results significantly outperform conventional MCTS-based approaches while preserving end-to-end streaming latency constraints.

Technology Category

Application Category

📝 Abstract
Autoregressive decoding algorithms that use only past information often cannot guarantee the best performance. Recently, people discovered that looking-ahead algorithms such as Monte Carlo Tree Search (MCTS) with external reward models (RMs) can significantly improve models' output by allowing them to think ahead and leverage future outputs and associated rewards to guide the current generation. Such techniques can help the reinforcement fine-tuning phase by sampling better trajectories and the inference phase by selecting the better output. However, their high computational cost limits their applications, especially in streaming scenarios. To address this issue, we propose equipping the policy model with token-level self-reward modeling (TRM) capability to eliminate the need for external models and extra communication. We name the new architecture as Reward Transformer. In addition, we propose a streaming-looking-ahead (SLA) algorithm to further boost search efficiency with better parallelization. Experiments show that SLA achieves an overall win rate of 79.7% against the baseline greedy decoding algorithm on three general-domain datasets with a frozen policy model while maintaining streaming efficiency. If we combine SLA with reinforcement fine-tuning techniques such as DPO, SLA achieves an overall win rate of 89.4%.
Problem

Research questions and friction points this paper is trying to address.

Improve autoregressive decoding with future token rewards
Reduce computational cost in streaming scenarios
Enhance model output quality and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level self-reward modeling (TRM)
Streaming-looking-ahead (SLA) algorithm
Reward Transformer architecture
🔎 Similar Papers
No similar papers found.
H
Hongming Zhang
Tencent AI Lab, Seattle
Ruixin Hong
Ruixin Hong
Tsinghua University
Reasoning with LLM
D
Dong Yu
Tencent AI Lab, Seattle