SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the inherent trade-off between training stability and expressive capacity in conventional Transformer architectures, where Pre-Norm and Post-Norm variants exhibit incompatible structural dynamics. To reconcile this tension, the authors propose SiameseNorm, a novel dual-stream Transformer architecture that couples Pre-Norm and Post-Norm pathways through a parameter-sharing mechanism. Each residual block thereby receives gradient signals from both normalization paradigms simultaneously, enabling joint optimization of stability and representational power within a single model. SiameseNorm uniquely decouples the optimization dynamics of the two normalization schemes, achieving superior training robustness in pretraining experiments at the 1.3-billion-parameter scale and consistently outperforming strong baseline models.

Technology Category

Application Category

📝 Abstract

Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

Problem

Research questions and friction points this paper is trying to address.

Pre-Norm

Post-Norm

Transformer

optimization stability

architectural incompatibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

SiameseNorm

Pre-Norm

Post-Norm