ShiQ: Bringing back Bellman to LLMs

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Q-learning remains underexplored in large language model (LLM) reinforcement fine-tuning, primarily due to the absence of a theoretically grounded loss for logits and the empirical failure of direct logits updates. Method: This paper proposes ShiQ—a token-level, offline, off-policy Q-learning framework for LLMs—derived rigorously from the Bellman equation to define a logits-based Q-value loss. Key components include a Bellman-error-driven logits loss, Shifted-Q target modeling, and token-wise Q estimation. Results: On benchmarks including UltraFeedback and BFCL-V3, ShiQ outperforms mainstream methods such as PPO in both single-turn and multi-turn preference alignment tasks, achieving a 3.2× improvement in sample efficiency. It is the first approach to unify theoretical rigor and engineering practicality for Q-learning in LLM reinforcement fine-tuning.

Technology Category

Application Category

📝 Abstract

The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights from the RL literature to account for LLM-specific characteristics, ensuring that the logits become reliable Q-value estimates. We then use this loss to build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement. Finally, we evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings

Problem

Research questions and friction points this paper is trying to address.

Adapt Q-learning to LLMs via Bellman equations

Improve sample efficiency in LLM reinforcement learning

Enable off-policy token-wise learning for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Derives Bellman-based loss for Q-learning in LLMs

Adapts RL insights for LLM-specific characteristics

Implements ShiQ algorithm for off-policy token learning

🔎 Similar Papers

No similar papers found.