Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses two key bottlenecks in offline policy distillation (OPD)—insufficient student exploration and unreliable teacher supervision—by proposing Uni-OPD, a unified OPD framework. Uni-OPD is the first to systematically identify and jointly optimize these issues: it enhances student exploration through a data-balancing strategy from the student’s perspective and ensures trajectory-level supervision consistency via a teacher-guided margin calibration mechanism. The framework is applicable to both large language models and multimodal large models, demonstrating consistent effectiveness across 16 benchmarks spanning five domains. It handles diverse distillation scenarios, including single- and multi-teacher settings, strong-to-weak transfer, and cross-modal distillation, thereby establishing a robust and versatile approach to policy distillation.

📝 Abstract

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.

Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation

informative states

teacher supervision

order consistency

student rollouts

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Distillation

Dual-Perspective Optimization

Outcome-Guided Calibration