🤖 AI Summary
To address attention localization and deep-layer entropy collapse in self-attention mechanisms of small-scale models, this paper proposes SAOBP: a framework integrating single-step belief propagation (SBP) into standard Transformers. By modeling attention as a dynamic graph structure, SAOBP explicitly enhances multi-hop dependency modeling. We innovatively introduce the Global Token Dependency (GTD) metric to quantify long-range relational contributions, thereby mitigating attention entropy collapse without increasing model depth. Experiments demonstrate that SAOBP significantly improves long-distance dependency capture in resource-constrained small models, yielding consistent performance gains across multiple downstream tasks—including language modeling, question answering, and text classification—while enhancing overall inference quality.
📝 Abstract
Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.