SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the challenge of token-level credit assignment under sparse rewards in policy-based reinforcement learning with large language models. The authors propose a dual-path adaptive training framework that dynamically routes training trajectories based on their correctness: for incorrect samples, it employs teacher-perplexity-weighted KL divergence distillation, while for correct samples, it applies student-perplexity-weighted maximum likelihood estimation. A group-level normalization mechanism is introduced to dynamically calibrate the weighting coefficients. This approach effectively prioritizes high-value samples and precisely reinforces low-confidence correct responses. Evaluated across six reasoning benchmarks, the method achieves an average improvement of 11.42% in Avg@32 and 7.30% in Pass@32, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.
Problem

Research questions and friction points this paper is trying to address.

on-policy reinforcement learning
token-level credit assignment
on-policy distillation
signal quality
reasoning alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Distillation
Adaptive Weighting
Dual-Path Framework
Token-Level Credit Assignment
Signal Calibration
Binbin Zheng
Binbin Zheng
Associate Professor, The Uniformed Services University of Health Sciences
Teaching and learning in health professions educationTechnology-supported learning
Xing Ma
Xing Ma
Meituan, NLP engineer
Dialog SystemLarge Language ModelConversation Analysis
Y
Yiheng Liang
Nanjing University, Meituan, Beijing, China
J
Jingqing Ruan
Meituan, Beijing, China
X
Xiaoliang Fu
Fudan University
K
Kepeng Lin
Huazhong University of Science and Technology
B
Benchang Zhu
Meituan, Beijing, China
K
Ke Zeng
Meituan, Beijing, China
X
Xunliang Cai
Meituan, Beijing, China