GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

136K/year

🤖 AI Summary

The placement of normalization layers in Transformers—before (Pre-Norm) or after (Post-Norm) residual connections—remains a subject of debate due to its impact on model performance and training stability. This work addresses this issue from the perspective of manifold optimization, interpreting the outputs of feed-forward networks and attention layers as update directions in an optimization trajectory. The authors propose GeoNorm, a novel normalization method based on geodesic updates, which unifies the Pre-Norm and Post-Norm frameworks. Additionally, they introduce an inter-layer update decay mechanism analogous to learning rate scheduling. Experimental results demonstrate that GeoNorm consistently outperforms existing normalization strategies across various Transformer architectures while incurring negligible computational overhead.

Technology Category

Application Category

📝 Abstract

The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.

Problem

Research questions and friction points this paper is trying to address.

normalization

Transformer

Pre-Norm

Post-Norm

manifold optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

GeoNorm

geodesic optimization

manifold optimization