SAS: Simulated Attention Score

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the parameter-constrained trade-off between the number of attention heads and per-head dimensionality in Transformers, this paper proposes Simulated Attention Scores (SAS): a method that explicitly simulates additional attention heads and enlarged key/query feature dimensions via learnable projections of low-dimensional latent representations—without increasing model parameters. SAS is integrated with Parameter-Efficient Attention Aggregation (PEAA) to enable high-capacity attention computation. Crucially, SAS is the first approach to leverage representation-space expansion for jointly simulating both head count and intra-head dimensionality, thereby substantially enhancing representational capacity while preserving model compactness. Extensive experiments demonstrate that SAS consistently outperforms mainstream attention variants—including multi-head, multi-query, and grouped-query attention—across diverse NLP tasks and benchmarks. Notably, these gains are achieved with zero parameter overhead, confirming SAS’s effectiveness in boosting performance under strict parameter budgets.

Technology Category

Application Category

📝 Abstract
The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
Problem

Research questions and friction points this paper is trying to address.

Enhancing attention mechanism performance without increasing parameters
Simulating larger attention heads and hidden dimensions compactly
Improving model expressiveness while maintaining original size
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulates more heads with compact model
Projects low-dim heads to high-dim space
Efficient attention aggregation reduces parameters
🔎 Similar Papers
No similar papers found.