🤖 AI Summary
A distributional mismatch arises between standard single-turn RLHF training and real-world multi-turn interactive deployment of LLMs, raising questions about whether multi-turn RLHF is necessary—or even beneficial—for improving reasoning capabilities. Method: We systematically compare single-turn RLHF against three distinct multi-turn RLHF strategies on rigorous reasoning benchmarks, evaluating both single-turn and multi-turn inference performance. Contribution/Results: Contrary to prevailing assumptions, models trained with single-turn RLHF significantly outperform those trained with multi-turn RLHF across both single-turn and multi-turn evaluations—demonstrating superior generalization and stability. Multi-turn training fails to enhance reasoning performance and consistently degrades it. These findings challenge the “more interaction is better” hypothesis, revealing that multi-turn feedback supervision provides negligible benefit—and may actively impair—reasoning in full-information settings. Our results offer critical empirical evidence urging reevaluation of current RLHF paradigms for reasoning-oriented LLM alignment.
📝 Abstract
The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.