On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address the efficiency–accuracy trade-off in long-context LLM inference, this paper proposes a dynamic online distillation framework. Our method replaces native Transformer layers with novel Dual-State Linear Attention (DSLA) layers during inference, leveraging a first-of-its-kind dual-state hidden representation to jointly model global dependencies and local sensitivity. Integrated with a sensitivity-driven layer replacement strategy and a chain-style incremental fine-tuning mechanism, the framework enables structural self-adaptation throughout inference. On multi-task benchmarks, our approach achieves 2.3× and 3.0× inference speedup over Llama2-7B and Zamba-7B, respectively, with no statistically significant performance degradation. Ablation studies confirm that DSLA effectively mitigates the short-range bias inherent in linear attention, substantially improving joint modeling of both long-range and short-range dependencies.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose extit{dual-state linear attention} ( extbf{dsla}), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce extbf{serve}, an online extit{adaptive distillation} framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. serve uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that serve yields extbf{2.3x} faster inference than Llama2-7B and extbf{3.0x} faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA's dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions. Codes are available at https://github.com/utnslab/DSLA-Serve.

Problem

Research questions and friction points this paper is trying to address.

Reduce compute and memory costs in lengthy LLM inputs

Mitigate accuracy loss from linear attention's recency bias

Balance efficiency and accuracy via adaptive distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-state linear attention balances history and recency

Online adaptive distillation replaces Transformer layers dynamically

Chained fine-tuning ensures consistency in layer conversion

🔎 Similar Papers

Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity