🤖 AI Summary
This study addresses the "Lost in Conversation" (LiC) problem in large language models, where intermediate assistant responses contaminate subsequent dialogue context, leading to degraded performance. The work identifies self-contamination as the root cause of LiC and introduces MAIGO—a training-stage-only solution that requires no verifier rewards, state labels, or inference-time assistance. MAIGO employs an intra-policy self-distillation mechanism: during intermediate turns, it removes prior assistant responses while preserving the user-visible prefix; during answer turns, it distills knowledge using the full user dialogue history. Additionally, reliability-weighted sampling is integrated to suppress noisy samples. Evaluated on Qwen2.5-7B-Instruct, MAIGO improves SHARDED accuracy from 52.8% to 66.1% and increases the SHARDED/FULL accuracy ratio from 66.5% to 84.1%, with no more than a 2.3-percentage-point drop in FULL accuracy.
📝 Abstract
Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.