MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the limitations of single-teacher distillation, which is constrained by the teacher’s capability ceiling, prone to propagating errors to the student, and susceptible to training instability due to multi-step error accumulation in agentic tasks. To overcome these issues, the authors propose a multi-agent debate-driven distillation framework that leverages a collective of teachers to generate token-level supervision weighted by debate-derived confidence scores, combined with On-Policy Agentic Distillation to enhance training stability. The method innovatively introduces a task-adaptive divergence mechanism that dynamically selects between Jensen–Shannon and reverse KL divergences based on task characteristics. Evaluated across six teacher–student configurations and five agentic and code-generation benchmarks, the approach achieves state-of-the-art performance, yielding average improvements of 2.4% and 3.7% on agentic and code tasks, respectively, under a 14B+8B→4B setting.

📝 Abstract

On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4\%$ and the code average by $+3.7\%$ over the stronger single-teacher OPD.

Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation

single-teacher ceiling

error compounding

agentic tasks

training instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Debate

On-Policy Distillation

Collective Intelligence