Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the high inference complexity and deployment challenges of Transformer models by systematically investigating knowledge distillation from large Transformer teacher models into nine subquadratic-complexity student architectures—including state space models (SSMs), linear attention variants, and recurrent models. We propose intelligent initialization strategies, notably QKV replication and matrix mixing, to bridge architectural disparities. To our knowledge, this is the first large-scale empirical study across multiple NLP benchmarks comparing how effectively diverse subquadratic architectures retain Transformer knowledge. Our results reveal that both architectural constraints and initialization critically influence knowledge transfer efficacy: we identify the most architecture-teacher-matched student design and demonstrate that intelligent initialization substantially accelerates convergence and improves final accuracy. The study establishes a reproducible distillation framework and provides empirical evidence to guide the development of efficient, lightweight language models.

Technology Category

Application Category

📝 Abstract

Knowledge distillation is a widely used technique for compressing large language models (LLMs) by training a smaller student model to mimic a larger teacher model. Typically, both the teacher and student are Transformer-based architectures, leveraging softmax attention for sequence modeling. However, the quadratic complexity of self-attention at inference time remains a significant bottleneck, motivating the exploration of subquadratic alternatives such as structured state-space models (SSMs), linear attention, and recurrent architectures. In this work, we systematically evaluate the transferability of knowledge distillation from a Transformer teacher to nine subquadratic student architectures. Our study aims to determine which subquadratic model best aligns with the teacher's learned representations and how different architectural constraints influence the distillation process. We also investigate the impact of intelligent initialization strategies, including matrix mixing and query-key-value (QKV) copying, on the adaptation process. Our empirical results on multiple NLP benchmarks provide insights into the trade-offs between efficiency and performance, highlighting key factors for successful knowledge transfer to subquadratic architectures.

Problem

Research questions and friction points this paper is trying to address.

Evaluating knowledge distillation from Transformers to subquadratic models

Identifying best subquadratic architecture for teacher-student alignment

Assessing impact of initialization strategies on distillation success

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation from Transformers to subquadratic models

Evaluates nine subquadratic student architectures

Investigates initialization strategies like QKV copying

🔎 Similar Papers

No similar papers found.