🤖 AI Summary
The placement of normalization layers in Transformers—before (Pre-Norm) or after (Post-Norm) residual connections—remains a subject of debate due to its impact on model performance and training stability. This work addresses this issue from the perspective of manifold optimization, interpreting the outputs of feed-forward networks and attention layers as update directions in an optimization trajectory. The authors propose GeoNorm, a novel normalization method based on geodesic updates, which unifies the Pre-Norm and Post-Norm frameworks. Additionally, they introduce an inter-layer update decay mechanism analogous to learning rate scheduling. Experimental results demonstrate that GeoNorm consistently outperforms existing normalization strategies across various Transformer architectures while incurring negligible computational overhead.
📝 Abstract
The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.