Adaptive Computation Pruning for the Forgetting Transformer

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

In the Forgetting Transformer (FoX), soft attention suffers from rapid decay induced by the forgetting gate, leading to substantial redundant global computation and degraded training efficiency. Method: We propose Adaptive Computation Pruning (ACP), the first dynamic path pruning mechanism for FoX that exploits the forgetting gate’s decay characteristics. ACP employs a time-varying threshold to ensure pruned attention weights are negligible, enabling lossless acceleration. It integrates forgetting-gate-enhanced Softmax, FLOPs-aware optimization, and RoPE-compatible design. Results: Experiments show ACP reduces Softmax-related FLOPs by 70% on average during pretraining, improves training throughput by 10–35%, and delivers greater gains in long-context scenarios—all without any accuracy degradation.

Technology Category

Application Category

📝 Abstract

The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on the local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. This is achieved using a dynamically set pruning threshold that ensures that the pruned attention weights remain negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 10% to 35% improvement in training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. We also perform several analyses to provide deeper insights into our method, such as examining the pruning patterns and analyzing the distribution of FLOP savings across different attention heads. Our code is available at https://github.com/zhixuan-lin/arctic-fox.

Problem

Research questions and friction points this paper is trying to address.

Dynamically prune computations in Forgetting Transformer to reduce FLOPs

Maintain model performance while improving training throughput by 10-35%

Analyze pruning patterns and FLOP savings across attention heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic pruning of decayed attention dependencies

Adaptive threshold ensures negligible weight impact

70% FLOP reduction in softmax attention

🔎 Similar Papers

No similar papers found.