Forgetting Transformer: Softmax Attention with a Forget Gate

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard Transformers face limitations in modeling long contexts, generalizing beyond training sequence lengths (length extrapolation), and performing well on short-context downstream tasks—largely due to their reliance on fixed position encodings. Method: We propose Forget Attention, a novel self-attention mechanism that introduces a data-driven, soft forgetting gate to dynamically attenuate unnormalized attention scores of non-critical key-value pairs, eliminating positional embeddings entirely while remaining natively compatible with FlashAttention. This is the first integration of gating—inspired by recurrent models—into non-recurrent Transformer architectures. We further design the Pro module to incorporate sequential modeling strengths from recurrent models. Contribution/Results: The unified architecture achieves superior performance on long-context language modeling and length extrapolation benchmarks, significantly outperforming standard Transformers. It surpasses state-of-the-art recurrent models (e.g., Mamba-2) on needle-in-a-haystack retrieval tasks and maintains stronger generalization on short-context downstream tasks.

Technology Category

Application Category

📝 Abstract
An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a"Pro"block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
Problem

Research questions and friction points this paper is trying to address.

Incorporates forget gate into Transformers for improved attention.
Enhances long-context language modeling and length extrapolation.
Maintains Transformer's long-context capabilities without positional embeddings.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Forgetting Attention mechanism in Transformers
Compatibility with FlashAttention algorithm
Pro block design enhances model performance
🔎 Similar Papers
No similar papers found.