Linear Attention for Efficient Bidirectional Sequence Modeling

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This paper addresses the challenge that linear attention mechanisms struggle to support bidirectional sequence modeling. To resolve this, we propose LION—a novel framework that establishes the first rigorous theoretical foundation for bidirectional linear attention, proving its equivalence to bidirectional RNNs while preserving parallelizable training and linear-time inference. Our key contributions are threefold: (1) the first exact equivalence formulation of linear attention in the bidirectional setting; (2) three new variants—LION-LIT, LION-D, and LION-S—incorporating a selective masking mechanism inspired by state space models (SSMs); and (3) native compatibility with RetNet architecture. Experiments demonstrate that LION achieves performance on par with standard Transformers and SSMs on bidirectional tasks, while significantly accelerating training. The implementation is publicly released.

Technology Category

Application Category

📝 Abstract

Transformers with linear attention enable fast and parallel training. Moreover, they can be formulated as Recurrent Neural Networks (RNNs), for efficient linear-time inference. While extensively evaluated in causal sequence modeling, they have yet to be extended to the bidirectional setting. This work introduces the LION framework, establishing new theoretical foundations for linear transformers in bidirectional sequence modeling. LION constructs a bidirectional RNN equivalent to full Linear Attention. This extends the benefits of linear transformers: parallel training, and efficient inference, into the bidirectional setting. Using LION, we cast three linear transformers to their bidirectional form: LION-LIT, the bidirectional variant corresponding to (Katharopoulos et al., 2020); LION-D, extending RetNet (Sun et al., 2023); and LION-S, a linear transformer with a stable selective mask inspired by selectivity of SSMs (Dao&Gu, 2024). Replacing the attention block with LION (-LIT, -D, -S) achieves performance on bidirectional tasks that approaches that of Transformers and State-Space Models (SSMs), while delivering significant improvements in training speed. Our implementation is available in http://github.com/LIONS-EPFL/LION.

Problem

Research questions and friction points this paper is trying to address.

Extends linear transformers to bidirectional sequence modeling

Enables parallel training and efficient inference bidirectionally

Introduces LION framework for theoretical and practical advancements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear Attention for efficiency

Bidirectional RNN equivalence

LION framework implementation

🔎 Similar Papers

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling