STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the limitations of existing linearized large language models, which rely on position-based sliding windows for token routing and struggle to capture global importance, while learnable feature mappings often distort pretrained representations and induce distributional shifts. To overcome these issues, the authors propose STILL, a framework that employs a self-saliency scoring mechanism to enable consistent local-to-global token selection. Within each sliding window, STILL applies sparse softmax attention to salient tokens and linear attention to the remaining context. It further introduces a norm-preserving feature mapping (NP-Map) to retain pretrained representations and integrates a unified training-inference architecture with a delayed selection strategy to enhance hardware efficiency. Experiments show that STILL matches or surpasses the original model on commonsense and general reasoning tasks and achieves up to an 86.2% improvement over prior linear methods on long-context benchmarks.

Technology Category

Application Category

📝 Abstract

Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms to alleviate the quadratic complexity of standard softmax attention. Existing methods perform token routing based on sliding-window partitions, resulting in position-based selection and fails to capture token-specific global importance. Meanwhile, linear attention further suffers from distribution shift caused by learnable feature maps that distort pretrained feature magnitudes. Motivated by these limitations, we propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs. STILL introduces a Self-Saliency Score with strong local-global consistency, enabling accurate token selection using sliding-window computation, and retains salient tokens for sparse softmax attention while summarizing the remaining context via linear attention. To preserve pretrained representations, we design a Norm-Preserved Feature Map (NP-Map) that decouples feature direction from magnitude and reinjects pretrained norms. We further adopt a unified training-inference architecture with chunk-wise parallelization and delayed selection to improve hardware efficiency. Experiments show that STILL matches or surpasses the original pretrained model on commonsense and general reasoning tasks, and achieves up to a 86.2% relative improvement over prior linearized attention methods on long-context benchmarks.

Problem

Research questions and friction points this paper is trying to address.

linear attention

token selection

distribution shift

large language models

hybrid attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

linear attention

token selection

self-saliency score