🤖 AI Summary
To address the efficiency–accuracy trade-off in long-context LLM inference, this paper proposes a dynamic online distillation framework. Our method replaces native Transformer layers with novel Dual-State Linear Attention (DSLA) layers during inference, leveraging a first-of-its-kind dual-state hidden representation to jointly model global dependencies and local sensitivity. Integrated with a sensitivity-driven layer replacement strategy and a chain-style incremental fine-tuning mechanism, the framework enables structural self-adaptation throughout inference. On multi-task benchmarks, our approach achieves 2.3× and 3.0× inference speedup over Llama2-7B and Zamba-7B, respectively, with no statistically significant performance degradation. Ablation studies confirm that DSLA effectively mitigates the short-range bias inherent in linear attention, substantially improving joint modeling of both long-range and short-range dependencies.
📝 Abstract
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose extit{dual-state linear attention} ( extbf{dsla}), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce extbf{serve}, an online extit{adaptive distillation} framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. serve uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that serve yields extbf{2.3x} faster inference than Llama2-7B and extbf{3.0x} faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA's dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions. Codes are available at https://github.com/utnslab/DSLA-Serve.