MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the limitations of single-teacher distillation, which is constrained by the teacher’s capability ceiling, prone to propagating errors to the student, and susceptible to training instability due to multi-step error accumulation in agentic tasks. To overcome these issues, the authors propose a multi-agent debate-driven distillation framework that leverages a collective of teachers to generate token-level supervision weighted by debate-derived confidence scores, combined with On-Policy Agentic Distillation to enhance training stability. The method innovatively introduces a task-adaptive divergence mechanism that dynamically selects between Jensen–Shannon and reverse KL divergences based on task characteristics. Evaluated across six teacher–student configurations and five agentic and code-generation benchmarks, the approach achieves state-of-the-art performance, yielding average improvements of 2.4% and 3.7% on agentic and code tasks, respectively, under a 14B+8B→4B setting.
📝 Abstract
On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4\%$ and the code average by $+3.7\%$ over the stronger single-teacher OPD.
Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation
single-teacher ceiling
error compounding
agentic tasks
training instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Debate
On-Policy Distillation
Collective Intelligence
Task-Adaptive Divergence
Agentic Distillation
🔎 Similar Papers
J
Jianze Wang
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology; Alibaba Group
Y
Ying Liu
Alibaba Group
J
Jinlong Chen
Alibaba Group
X
Xuchun Hu
Alibaba Group
Qilong Zhang
Qilong Zhang
ByteDance
adversarial examplesblind watermarkLLMPost-Training
Yu Cao
Yu Cao
Alibaba, ex-Huawei, University of Edinburgh
RLHFRoboticsAI
J
Jun Wang
Alibaba Group
Hua Yang
Hua Yang
Redrock Biometrics
BiometricsMotion TrackingComputer VisionAugmented RealityImage Processing
Y
Yong Xie
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Q
Qianglong Chen
Alibaba Group