LLMs Get Lost In Multi-Turn Conversation

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit significant performance degradation—averaging 39%—in multi-turn dialogues; this decline stems not from fundamental capability loss but from “premature hypothesis fixation” in early turns, leading to irreversible deviation (“lost-and-unrecoverable” behavior). Method: Leveraging over 200,000 controlled synthetic dialogues, we decouple multi-turn decay into marginal capability erosion and severe unreliability, introducing a novel error-attribution analytical framework. We validate findings via multitask generative evaluation across diverse open- and closed-source LLMs. Contribution/Results: We propose the first dialogue-robustness–oriented multi-turn reliability benchmark, enabling precise failure-mode localization. Our framework reveals universal unreliability across all major LLMs tested and provides both theoretical grounding and practical evaluation tools to enhance LLM dialogue stability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.
Problem

Research questions and friction points this paper is trying to address.

LLMs perform worse in multi-turn than single-turn conversations
LLMs lose reliability when handling underspecified user instructions
LLMs make premature assumptions and fail to recover from errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale simulation experiments compare LLM performance
Analysis decomposes performance drop into aptitude and unreliability
LLMs make premature assumptions and overly rely on them
🔎 Similar Papers