Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address attention localization and deep-layer entropy collapse in self-attention mechanisms of small-scale models, this paper proposes SAOBP: a framework integrating single-step belief propagation (SBP) into standard Transformers. By modeling attention as a dynamic graph structure, SAOBP explicitly enhances multi-hop dependency modeling. We innovatively introduce the Global Token Dependency (GTD) metric to quantify long-range relational contributions, thereby mitigating attention entropy collapse without increasing model depth. Experiments demonstrate that SAOBP significantly improves long-distance dependency capture in resource-constrained small models, yielding consistent performance gains across multiple downstream tasks—including language modeling, question answering, and text classification—while enhancing overall inference quality.

Technology Category

Application Category

📝 Abstract

Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses attention localization in small-scale transformers

Improves long-range dependency capture via belief propagation

Prevents entropy collapse in deeper model layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Attention One-step Belief Propagation refinement

Global Token Dependency metric for multi-hop connections

Prevents entropy collapse in deeper transformer layers

🔎 Similar Papers

No similar papers found.

Authors to Follow