🤖 AI Summary
This work addresses the lack of theoretical understanding regarding the advantages of gated attention mechanisms in improving sample efficiency and model performance. We propose the first formal analysis by modeling gated attention as a hierarchical mixture-of-experts system, reframing the learning process as an expert estimation problem. Leveraging statistical learning theory, we analyze its sample complexity and demonstrate that, unlike conventional multi-head self-attention—which requires an exponential number of samples to achieve a given estimation accuracy—gated attention attains the same precision with only a polynomial number of samples. Furthermore, our framework provides a theoretical foundation for the placement of gating modules, revealing their critical influence on overall model performance.
📝 Abstract
Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention's benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated attention is more sample-efficient than multi-head self-attention. In particular, while the former needs only a polynomial number of data points to estimate an expert, the latter requires exponentially many data points to achieve the same estimation error. Furthermore, our analysis also provides a theoretical justification for why gated attention yields higher performance when a gate is placed at the output of the scaled dot product attention or the value map rather than at other positions in the multi-head self-attention architecture.