Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of imitation learning from offline demonstrations—namely, the absence of online corrective signals and constrained generalization and exploration capabilities—by introducing FA-OPD, a novel approach that uniquely integrates flow-matching teacher models with adversarial dual-channel online distillation. FA-OPD employs a reward channel to provide a long-horizon expert similarity objective that encourages exploration, while an action channel delivers localized, dense supervision to stabilize policy execution; together, these channels enable efficient and robust learning. Evaluated across six benchmark tasks spanning robotic navigation, manipulation, and locomotion, FA-OPD substantially outperforms strong existing baselines and demonstrates enhanced robustness under noisy or sparse demonstration conditions.
📝 Abstract
Learning from demonstrations in embodied control is often cast as behavioral cloning, and recent diffusion or flow-matching policies improve this paradigm by modeling multi-modal expert actions. Yet these methods remain offline supervised learners: the policy is trained only on expert states and receives no corrective signal on the states it actually visits. On-policy distillation (OPD) offers a natural remedy, but standard OPD assumes a strong fixed teacher, which is unavailable in demonstration-only control. We propose \textbf{FA-OPD}, an \emph{adversarial dual on-policy distillation} method in which a Flow Matching (FM) teacher is learned from demonstrations and co-trained with a lightweight MLP student. The teacher provides two complementary signals on student rollouts. The reward channel learns an expert-likeness objective over state-action pairs and drives online exploration through long-horizon policy optimization. The action channel supplies dense local targets at student-visited states, stabilizing exploitation. FA-OPD couples them so that reward distillation enables generalization beyond point-wise demonstrations, while action distillation keeps exploration anchored near expert-like behavior. Across six robot navigation, manipulation, and locomotion benchmarks, FA-OPD beats strong baselines and shows much stronger robustness under noisy or limited demonstrations.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
behavioral cloning
flow-based policy
embodied control
demonstration learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
flow matching
behavioral cloning
adversarial learning
embodied control
🔎 Similar Papers
No similar papers found.
Z
Zhenglin Wan
School of Computing, National University of Singapore
J
Jingxuan Wu
Department of Statistics and Operations Research, UNC-Chapel Hill, America
Xingrui Yu
Xingrui Yu
Scientist, CFAR, A*STAR
Machine LearningRobust Imitation LearningTrustworthy AI
Chubin Zhang
Chubin Zhang
Tsinghua University
Embodied AI3D Vision
Mingcong Lei
Mingcong Lei
Chinese University of Hong Kong, Shenzhen
AIAgentEmbodiedDeep Learning
Bo An
Bo An
Nanyang Technological University
Artificial intelligencemulti-agent systemsgame theoryreinforcement learningoptimization
I
Ivor W. Tsang
CFAR, Agency for Science, Technology and Research, Singapore; IHPC, Agency for Science, Technology and Research, Singapore; College of Computing and Data Science, Nanyang Technological University, Singapore
Yang You
Yang You
Postdoc, Stanford University
3D visioncomputer graphicscomputational geometry