🤖 AI Summary
Existing methods struggle to effectively distill large language models based on quadratic attention into sub-quadratic linear architectures—such as xLSTM—without degrading downstream performance. This work proposes an efficient distillation framework tailored for xLSTM, which first employs a multi-expert linearized distillation phase followed by an expert-merging stage to consolidate multiple student experts into a single high-performance model. We introduce a lossless distillation evaluation metric based on tolerance-corrected win rates and tie rates, and achieve, for the first time, high-fidelity knowledge transfer from teacher models—including Llama, Qwen, and Olmo—to xLSTM. Experiments demonstrate that the proposed method not only recovers but often surpasses the teachers’ performance across diverse downstream tasks, significantly advancing the development of efficient alternatives to the Transformer architecture.
📝 Abstract
There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.