Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

This work addresses the severe degradation of multi-step reasoning capabilities—termed “reasoning collapse”—commonly observed during efficient distillation of large language models. The study identifies the root cause as effective rank (eRank) collapse in hidden representations, induced by width compression that renders token embeddings indistinguishable due to poorly structured projection matrices. To mitigate this, the authors propose RED (Rank-preserving Efficient Distillation), which employs activation-aware initialization to configure projection matrices as channel-selection operators, thereby preserving eRank. Theoretical analysis and extensive experiments establish, for the first time, a direct link between eRank collapse and reasoning performance decline. Evaluated on Llama and Qwen model families, RED substantially restores multi-step reasoning ability while maintaining state-of-the-art general performance and high training efficiency.

📝 Abstract

Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state-of-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

Problem

Research questions and friction points this paper is trying to address.

reasoning collapse

efficient distillation

eRank collapse

large language models

multi-step reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning collapse

effective rank (eRank)

activation-aware initialization