Integral Transformer: Denoising Attention, Not Too Much Not Too Little

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Softmax-based self-attention often assigns excessive weight to semantically irrelevant tokens (e.g., punctuation, special symbols), causing attention noise and rank collapse. To address this, we propose the Integral Transformer, which avoids negative attention scores and instead dynamically calibrates attention distributions by sampling from and integrating over the logit distribution—thereby suppressing noise while preserving critical special tokens (e.g., [CLS], [SEP]). Our method employs a hierarchical strategy: lower layers retain standard Softmax attention to ensure representational fidelity, while upper layers adopt integral attention to enhance stability and generalization. Evaluated on multiple knowledge-intensive reasoning benchmarks—including CLUE, ReClor, and LogiQA—the Integral Transformer consistently outperforms standard Transformers, Cog Attention, and Differential Transformers. It effectively mitigates rank collapse and significantly improves modeling capacity for complex logical and deductive reasoning tasks.

Technology Category

Application Category

📝 Abstract

Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.

Problem

Research questions and friction points this paper is trying to address.

Addresses attention noise in softmax self-attention mechanisms

Mitigates disproportionate weighting of uninformative tokens like punctuation

Preserves critical contributions of special tokens while denoising

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integral Transformer integrates logit distribution signals

Denoises attention while preserving special tokens

Balances attention distributions and reduces rank collapse

🔎 Similar Papers

No similar papers found.