From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses the high computational cost of self-attention in Transformers, noting that naively replacing attention layers with lightweight sequential modules often incurs significant performance degradation. To mitigate this, the authors propose a layer-wise distillation framework guided by attention sparsity: by analyzing the sparsity patterns across layers, they identify those amenable to lossless replacement and integrate an AViT-style token retention strategy to impose explicit sparsity. These sparse layers are then substituted with efficient sequential modules via a plug-in distillation approach. Experiments demonstrate that, under a fixed training budget, the method substantially reduces model parameters and inference latency while incurring only minimal accuracy loss. Moreover, higher sparsity in the teacher model correlates with a smaller performance gap between student and teacher, validating the efficacy of sparsity-guided layer replacement.

📝 Abstract

Self-attention serves as the core foundation of large-scale transformer pretraining, but its quadratic token interaction cost makes inference expensive. Replacing attention with simpler sequential modules is appealing, yet naive substitution is often lossy, especially at larger scales. This paper revisits attention replacement through the lens of sparsity. Based on the observation of diverse sparsity patterns across transformer layers, we posit that pretrained transformers decompose the complex token dependency across tokens into various sequence-to-sequence mappings of diverse complexities, where some layer functionalities can be approximated and replaced with much simpler sequential modules without loss. We evaluate this premise using a plug-and-play layer-wise distillation framework to approximate and replace attention functionalities in pretrained vision transformer models. Controlled group-wise replacements under a fixed training budget reveal a clear pattern: substituting layers with sparser attention incurs substantially smaller accuracy drops than replacing denser ones. We further impose explicit attention sparsity on the pretrained ViT via AViT-style token retention and perform sparsity-guided distillation for sequential replacing models, where we see increasing teacher sparsity consistently reduces the student-teacher gap. The proposed method achieves efficient attention replacement for reduced parameter size and latency through the guidance of attention sparsity.

Problem

Research questions and friction points this paper is trying to address.

attention replacement

sparse attention

sequential modules

model efficiency

transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention

Attention Distillation

Sequential Replacement