Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

A distributional mismatch arises between standard single-turn RLHF training and real-world multi-turn interactive deployment of LLMs, raising questions about whether multi-turn RLHF is necessary—or even beneficial—for improving reasoning capabilities. Method: We systematically compare single-turn RLHF against three distinct multi-turn RLHF strategies on rigorous reasoning benchmarks, evaluating both single-turn and multi-turn inference performance. Contribution/Results: Contrary to prevailing assumptions, models trained with single-turn RLHF significantly outperform those trained with multi-turn RLHF across both single-turn and multi-turn evaluations—demonstrating superior generalization and stability. Multi-turn training fails to enhance reasoning performance and consistently degrades it. These findings challenge the “more interaction is better” hypothesis, revealing that multi-turn feedback supervision provides negligible benefit—and may actively impair—reasoning in full-information settings. Our results offer critical empirical evidence urging reevaluation of current RLHF paradigms for reasoning-oriented LLM alignment.

Technology Category

Application Category

📝 Abstract

The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Studies multi-turn human feedback necessity for LLM reasoning tasks

Compares single-turn versus multi-turn training strategies effectiveness

Finds single-turn training generalizes better than multi-turn approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-turn training outperforms multi-turn strategies

Multi-turn training degrades single-turn reasoning performance

Basic human feedback provides limited benefits for reasoning

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting