🤖 AI Summary
This work addresses the significant performance degradation of large language models (LLMs) in multi-turn dialogues—commonly referred to as the “Lost in Conversation” phenomenon—which arises from misalignment between user intent and model interpretation. The study identifies the root cause not as a limitation of model capabilities, but as intent mismatch embedded within the interaction structure. To resolve this, the authors propose the Mediator-Assistant architecture, which decouples intent understanding from task execution. Specifically, an experience-driven Mediator module explicitly models user intent based on dialogue history and transforms ambiguous inputs into structured instructions for the Assistant. Experimental results demonstrate that this approach effectively mitigates performance decay across multiple LLMs in multi-turn settings, confirming both its efficacy and generalizability.
📝 Abstract
Multi-turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow-up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research reveals that LLMs suffer a substantial performance drop in multi-turn settings compared to single-turn interactions with fully specified instructions, a phenomenon termed ``Lost in Conversation''(LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator-Assistant architecture. By utilizing an experience-driven Mediator to explicate user inputs into explicit, well-structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi-turn conversations across diverse LLMs.