Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of language models deviating from normative responses in multi-turn dialogues due to unfounded early assumptions. It proposes Canonical-Context On-Policy Distillation (CCOPD), a novel approach that, for the first time, explicitly targets normative behavior under full-context conditions as the learning objective for multi-turn dialogue. CCOPD employs the same base model as both a fixed teacher—conditioned on the complete dialogue context—and a trainable student—conditioned on its own multi-turn trajectory—and aligns the student’s policy with the teacher’s responses via on-policy distillation. This method effectively mitigates self-anchor drift, achieving an average relative improvement of 32% on the RAW-SHARDED multi-turn benchmark using only mathematical dialogue data, while maintaining performance under full-context conditions and demonstrating strong zero-shot generalization across five additional out-of-domain tasks.
📝 Abstract
Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.
Problem

Research questions and friction points this paper is trying to address.

multi-turn language models
self-anchored drift
context fragmentation
response consistency
evidence grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Canonical-Context On-Policy Distillation
self-anchored drift
multi-turn language models
on-policy distillation
context grounding