STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing linearized large language models, which rely on position-based sliding windows for token routing and struggle to capture global importance, while learnable feature mappings often distort pretrained representations and induce distributional shifts. To overcome these issues, the authors propose STILL, a framework that employs a self-saliency scoring mechanism to enable consistent local-to-global token selection. Within each sliding window, STILL applies sparse softmax attention to salient tokens and linear attention to the remaining context. It further introduces a norm-preserving feature mapping (NP-Map) to retain pretrained representations and integrates a unified training-inference architecture with a delayed selection strategy to enhance hardware efficiency. Experiments show that STILL matches or surpasses the original model on commonsense and general reasoning tasks and achieves up to an 86.2% improvement over prior linear methods on long-context benchmarks.

Technology Category

Application Category

📝 Abstract
Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms to alleviate the quadratic complexity of standard softmax attention. Existing methods perform token routing based on sliding-window partitions, resulting in position-based selection and fails to capture token-specific global importance. Meanwhile, linear attention further suffers from distribution shift caused by learnable feature maps that distort pretrained feature magnitudes. Motivated by these limitations, we propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs. STILL introduces a Self-Saliency Score with strong local-global consistency, enabling accurate token selection using sliding-window computation, and retains salient tokens for sparse softmax attention while summarizing the remaining context via linear attention. To preserve pretrained representations, we design a Norm-Preserved Feature Map (NP-Map) that decouples feature direction from magnitude and reinjects pretrained norms. We further adopt a unified training-inference architecture with chunk-wise parallelization and delayed selection to improve hardware efficiency. Experiments show that STILL matches or surpasses the original pretrained model on commonsense and general reasoning tasks, and achieves up to a 86.2% relative improvement over prior linearized attention methods on long-context benchmarks.
Problem

Research questions and friction points this paper is trying to address.

linear attention
token selection
distribution shift
large language models
hybrid attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

linear attention
token selection
self-saliency score
norm-preserved feature map
hybrid attention
🔎 Similar Papers
No similar papers found.
W
Weikang Meng
SMULL Group, Harbin Institute of Technology, Shenzhen; Pengcheng Laboratory
L
Liangyu Huo
SMULL Group, Harbin Institute of Technology, Shenzhen
Yadan Luo
Yadan Luo
ARC DECRA and Senior Lecturer, University of Queensland
Generalization3D VisionAutonomous Driving
J
Jiawen Guan
SMULL Group, Harbin Institute of Technology, Shenzhen
Jingyi Zhang
Jingyi Zhang
huawei
LLM / AI infra / deep learning
Y
Yingjian Li
Pengcheng Laboratory
Zheng Zhang
Zheng Zhang
HIT, SLAI
Multimodal LearningEfficient Deep LearningAI Security