Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

📅 2024-10-11
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from low efficiency and suboptimal performance in multi-step reasoning tasks—e.g., mathematical problem solving—due to inadequate process-level supervision and structural constraints in existing alignment methods. Method: This paper formulates response generation as a Markov decision process (MDP) and proposes an offline reinforcement learning framework based on Soft Actor-Critic (SAC). Crucially, it directly parameterizes the Q-function using the LLM itself—bypassing the architectural limitations of traditional bandit-style approaches—and enables fine-grained, stepwise supervision over reasoning trajectories without requiring online sampling. Contribution/Results: The method achieves significant computational savings while outperforming mainstream alignment techniques—including PPO and DPO—on GSM8K and MATH benchmarks. It establishes a novel, efficient, low-overhead, and structurally grounded paradigm for aligning LLMs in multi-step reasoning, demonstrating both empirical superiority and conceptual advancement.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
Problem

Research questions and friction points this paper is trying to address.

Optimize multi-step reasoning in language models
Reduce computational resources in model training
Improve performance in math problem-solving tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct Q-function Optimization
Markov Decision Process
Soft Actor-Critic framework
🔎 Similar Papers
No similar papers found.