Theory, Analysis, and Best Practices for Sigmoid Self-Attention

📅 2024-09-06
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Softmax-based self-attention suffers from computational overhead and hardware inefficiency, motivating exploration of sigmoid as an alternative activation; however, sigmoid attention faces challenges including training instability and lack of theoretical grounding. Method: We first establish its universal approximation capability and implicit regularization property; propose attention initial-value normalization to stabilize training under large-norm conditions; and design FLASHSIGMOID—a hardware-efficient CUDA kernel optimized for modern GPUs. Contribution/Results: Our framework achieves performance on par with Softmax baselines across language, vision, and speech multimodal tasks, while accelerating inference by 17% on H100 GPUs. This work delivers the first plug-and-play, theoretically rigorous, and engineering-ready sigmoid attention framework—introducing a new paradigm for lightweight, efficient attention mechanisms.

Technology Category

Application Category

📝 Abstract
Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.
Problem

Research questions and friction points this paper is trying to address.

Transformer Design
Sigmoid Attention Mechanism
Performance Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

sigmoid self-attention
FLASHSIGMOID algorithm
performance enhancement
🔎 Similar Papers
No similar papers found.