Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A distributional mismatch arises between standard single-turn RLHF training and real-world multi-turn interactive deployment of LLMs, raising questions about whether multi-turn RLHF is necessary—or even beneficial—for improving reasoning capabilities. Method: We systematically compare single-turn RLHF against three distinct multi-turn RLHF strategies on rigorous reasoning benchmarks, evaluating both single-turn and multi-turn inference performance. Contribution/Results: Contrary to prevailing assumptions, models trained with single-turn RLHF significantly outperform those trained with multi-turn RLHF across both single-turn and multi-turn evaluations—demonstrating superior generalization and stability. Multi-turn training fails to enhance reasoning performance and consistently degrades it. These findings challenge the “more interaction is better” hypothesis, revealing that multi-turn feedback supervision provides negligible benefit—and may actively impair—reasoning in full-information settings. Our results offer critical empirical evidence urging reevaluation of current RLHF paradigms for reasoning-oriented LLM alignment.

Technology Category

Application Category

📝 Abstract
The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Studies multi-turn human feedback necessity for LLM reasoning tasks
Compares single-turn versus multi-turn training strategies effectiveness
Finds single-turn training generalizes better than multi-turn approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-turn training outperforms multi-turn strategies
Multi-turn training degrades single-turn reasoning performance
Basic human feedback provides limited benefits for reasoning
🔎 Similar Papers
No similar papers found.
Q
Qiang Liu
Tencent Interactive Entertainment, Shenzhen, China
W
Wuganjing Song
Tencent Interactive Entertainment, Shenzhen, China
Z
Zhenzhou Lin
Tencent Interactive Entertainment, Shenzhen, China
F
Feifan Chen
Tencent Interactive Entertainment, Shenzhen, China
Q
Qiaolong Cai
Tencent Interactive Entertainment, Shenzhen, China
C
Chen Li
Tencent Interactive Entertainment, Shenzhen, China
Yongduo Sui
Yongduo Sui
Tencent
LLMAgentGraph LearningRecommendation