OVD: On-policy Verbal Distillation

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing policy distillation methods rely on token-level alignment between teacher and student models, which constrains exploration, struggles to incorporate environmental feedback, and incurs substantial memory overhead. This work proposes a memory-efficient, trajectory-level policy distillation framework that abandons token-by-token matching in favor of aligning complete trajectories using discrete language-based scores (ranging from 0 to 9) generated by the teacher model. This approach enables the student to freely explore the output space, effectively integrates interactive feedback, and substantially reduces memory consumption. Evaluated on web question answering and mathematical reasoning tasks, the method achieves gains of up to 12.9% and 25.7% in exact match accuracy, respectively, using only single-example training, while also demonstrating superior training efficiency compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
token-level alignment
memory bottleneck
reinforcement learning
knowledge distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy Distillation
Verbal Feedback
Trajectory Matching
Memory Efficiency
Knowledge Distillation
🔎 Similar Papers
No similar papers found.
Jing Xiong
Jing Xiong
The University of Hong Kong
Natural Language ProcessingAutomated Theorem Proving
H
Hui Shen
The University of Hong Kong, Hong Kong, China
Shansan Gong
Shansan Gong
HKU; SJTU
NLPML
Y
Yuxin Cheng
The University of Hong Kong, Hong Kong, China
J
Jianghan Shen
Nanjing University, Nanjing, China
C
Chaofan Tao
Huawei Technologies, China
Haochen Tan
Haochen Tan
City University of Hong Kong
NLPDeep Learning
Haoli Bai
Haoli Bai
Huawei Technologies
natural language processingmodel compression
Lifeng Shang
Lifeng Shang
Huawei Noah's Ark Lab
Machine LearningComputer VisionPattern ReconitionNatural Language Processing
N
Ngai Wong
The University of Hong Kong, Hong Kong, China