MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This study addresses the "Lost in Conversation" (LiC) problem in large language models, where intermediate assistant responses contaminate subsequent dialogue context, leading to degraded performance. The work identifies self-contamination as the root cause of LiC and introduces MAIGO—a training-stage-only solution that requires no verifier rewards, state labels, or inference-time assistance. MAIGO employs an intra-policy self-distillation mechanism: during intermediate turns, it removes prior assistant responses while preserving the user-visible prefix; during answer turns, it distills knowledge using the full user dialogue history. Additionally, reliability-weighted sampling is integrated to suppress noisy samples. Evaluated on Qwen2.5-7B-Instruct, MAIGO improves SHARDED accuracy from 52.8% to 66.1% and increases the SHARDED/FULL accuracy ratio from 66.5% to 84.1%, with no more than a 2.3-percentage-point drop in FULL accuracy.

📝 Abstract

Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.

Problem

Research questions and friction points this paper is trying to address.

lost-in-conversation

self-contamination

multi-turn dialogue

large language models

context degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

lost-in-conversation

self-contamination

on-policy self-distillation