🤖 AI Summary
This work addresses the challenge in distilling reasoning capabilities from large language models to smaller student models, where inconsistent reasoning path structures generated by the teacher for semantically similar questions hinder effective learning. To resolve this, the authors propose a Dynamic Reasoning Path Compression (D-RPC) distillation mechanism that constructs and maintains a high-order reasoning path repository, guiding the teacher during training to produce reasoning trajectories that are both structurally consistent and diverse in coverage, based on the most relevant stored paths. The approach uniquely integrates PAC-Bayes generalization bound analysis into reasoning distillation, achieving a theoretically optimal trade-off between path repository size and coverage capacity. Experiments across five mathematical and commonsense reasoning benchmarks demonstrate that two distinct student models consistently outperform existing baselines while requiring fewer generated tokens than template-intensive methods.
📝 Abstract
When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.