Effective Distillation to Hybrid xLSTM Architectures

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to effectively distill large language models based on quadratic attention into sub-quadratic linear architectures—such as xLSTM—without degrading downstream performance. This work proposes an efficient distillation framework tailored for xLSTM, which first employs a multi-expert linearized distillation phase followed by an expert-merging stage to consolidate multiple student experts into a single high-performance model. We introduce a lossless distillation evaluation metric based on tolerance-corrected win rates and tie rates, and achieve, for the first time, high-fidelity knowledge transfer from teacher models—including Llama, Qwen, and Olmo—to xLSTM. Experiments demonstrate that the proposed method not only recovers but often surpasses the teachers’ performance across diverse downstream tasks, significantly advancing the development of efficient alternatives to the Transformer architecture.

Technology Category

Application Category

📝 Abstract
There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.
Problem

Research questions and friction points this paper is trying to address.

distillation
large language models
xLSTM
linearized architectures
lossless distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

distillation
xLSTM
lossless distillation
linearized architectures
model merging
🔎 Similar Papers
No similar papers found.