Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address attention localization and deep-layer entropy collapse in self-attention mechanisms of small-scale models, this paper proposes SAOBP: a framework integrating single-step belief propagation (SBP) into standard Transformers. By modeling attention as a dynamic graph structure, SAOBP explicitly enhances multi-hop dependency modeling. We innovatively introduce the Global Token Dependency (GTD) metric to quantify long-range relational contributions, thereby mitigating attention entropy collapse without increasing model depth. Experiments demonstrate that SAOBP significantly improves long-distance dependency capture in resource-constrained small models, yielding consistent performance gains across multiple downstream tasks—including language modeling, question answering, and text classification—while enhancing overall inference quality.

Technology Category

Application Category

📝 Abstract
Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses attention localization in small-scale transformers
Improves long-range dependency capture via belief propagation
Prevents entropy collapse in deeper model layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Attention One-step Belief Propagation refinement
Global Token Dependency metric for multi-hop connections
Prevents entropy collapse in deeper transformer layers
🔎 Similar Papers
No similar papers found.
N
Nakyung Lee
Seoul National University
Y
Yeongoon Kim
Seoul National University
M
Minhae Oh
Seoul National University
S
Suhwan Kim
Seoul National University
J
Jin Woo Koo
Seoul National University
H
Hyewon Jo
Seoul National University
Jungwoo Lee
Jungwoo Lee
Professor, Department of Electrical and Computer Engineering, Seoul National University
Machine LearningDistributed ComputingInformation Theory