Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the trade-off between task performance and teacher behavioral consistency in large language model knowledge distillation, formalizing it as a constrained reinforcement learning problem. We propose an end-to-end optimization framework that requires no runtime access to the teacher model and avoids state-space expansion. Grounded in constrained Markov decision processes, our method introduces a modified reward function that jointly incorporates task rewards and KL-divergence constraints on output distributions, with theoretical guarantees for constraint satisfaction. Compared to soft Lagrangian baselines, our approach achieves significant improvements on mathematical reasoning tasks: +12.3% constraint satisfaction rate and +4.7% reasoning accuracy, while maintaining state-of-the-art task performance—without incurring the high computational overhead typical of conventional dual optimization methods.

Technology Category

Application Category

📝 Abstract
We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Formulating LLM distillation as constrained reinforcement learning problem
Maximizing task rewards while limiting divergence from teacher model
Providing efficient distillation solution for resource-constrained settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulates LLM distillation as constrained reinforcement learning problem
Maximizes task rewards while limiting teacher divergence threshold
Uses modified reward function without teacher access during deployment
🔎 Similar Papers
No similar papers found.
Matthieu Zimmer
Matthieu Zimmer
RL Research Scientist @ Huawei Noah’s Ark Lab
artificial intelligence : learningdevelopmental learningreinforcement learningneural networks
X
Xiaotong Ji
Huawei Noah’s Ark Lab
T
Tu Nguyen
Huawei R&D Munich
H
Haitham Bou Ammar
Huawei Noah’s Ark Lab, UCL Centre for Artificial Intelligence