Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

📅 2024-10-06
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face challenges of covariate shift and insufficient policy generalization in long-horizon tasks such as multi-turn dialogue, where conventional multi-turn RLHF relies on a fixed reference policy to generate historical contexts—inducing distributional mismatch between training and real-world interaction. This work proposes REFUEL, a framework that reformulates multi-turn RLHF as an iterative Q-value regression problem. REFUEL employs a single model for both policy evaluation and optimization, eliminating the need for explicit context modeling or multi-policy coordination; instead, it constructs iterative training datasets via self-generated trajectories. We provide theoretical guarantees that REFUEL’s training process provably covers the optimal policy set. Experiments on Llama-3-8B-instruct demonstrate that REFUEL-finetuned models outperform Llama-3.1-70B-instruct in long-horizon multi-turn dialogue, and significantly surpass DPO and REBEL baselines.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate $Q$-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at https://github.com/ZhaolinGao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI.
Problem

Research questions and friction points this paper is trying to address.

Addresses covariate shift in multi-turn RLHF for LLMs
Improves long-term planning in multi-turn dialogue tasks
Enhances performance with self-generated data and Q-value estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single model estimates Q-values efficiently
Trains on self-generated data to reduce shift
Frames RLHF as iterative regression tasks
🔎 Similar Papers
No similar papers found.