🤖 AI Summary
Softmax-based self-attention often assigns excessive weight to semantically irrelevant tokens (e.g., punctuation, special symbols), causing attention noise and rank collapse. To address this, we propose the Integral Transformer, which avoids negative attention scores and instead dynamically calibrates attention distributions by sampling from and integrating over the logit distribution—thereby suppressing noise while preserving critical special tokens (e.g., [CLS], [SEP]). Our method employs a hierarchical strategy: lower layers retain standard Softmax attention to ensure representational fidelity, while upper layers adopt integral attention to enhance stability and generalization. Evaluated on multiple knowledge-intensive reasoning benchmarks—including CLUE, ReClor, and LogiQA—the Integral Transformer consistently outperforms standard Transformers, Cog Attention, and Differential Transformers. It effectively mitigates rank collapse and significantly improves modeling capacity for complex logical and deductive reasoning tasks.
📝 Abstract
Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.