SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inherent trade-off between training stability and expressive capacity in conventional Transformer architectures, where Pre-Norm and Post-Norm variants exhibit incompatible structural dynamics. To reconcile this tension, the authors propose SiameseNorm, a novel dual-stream Transformer architecture that couples Pre-Norm and Post-Norm pathways through a parameter-sharing mechanism. Each residual block thereby receives gradient signals from both normalization paradigms simultaneously, enabling joint optimization of stability and representational power within a single model. SiameseNorm uniquely decouples the optimization dynamics of the two normalization schemes, achieving superior training robustness in pretraining experiments at the 1.3-billion-parameter scale and consistently outperforming strong baseline models.

Technology Category

Application Category

📝 Abstract
Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.
Problem

Research questions and friction points this paper is trying to address.

Pre-Norm
Post-Norm
Transformer
optimization stability
architectural incompatibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

SiameseNorm
Pre-Norm
Post-Norm
two-stream architecture
Transformer optimization
🔎 Similar Papers
No similar papers found.
T
Tianyu Li
Leap Lab, Tsinghua University
Dongchen Han
Dongchen Han
Tsinghua University
Computer VisionDeep Learning
Z
Zixuan Cao
Institute for Interdisciplinary Information Sciences, Tsinghua University
Haofeng Huang
Haofeng Huang
Tsinghua University
Generative ModelsEfficient Machine LearningMachine Learning System
Mengyu Zhou
Mengyu Zhou
Microsoft Research
Data analyticsNatural Language ProcessingNetwork ScienceHuman BehaviorsMobile & Ubiquitous Computing
M
Ming Chen
Qwen Large Model Application Team, Alibaba
E
Erchao Zhao
Qwen Large Model Application Team, Alibaba
X
Xiaoxi Jiang
Qwen Large Model Application Team, Alibaba
G
Guanjun Jiang
Qwen Large Model Application Team, Alibaba
G
Gao Huang
Leap Lab, Tsinghua University