🤖 AI Summary
This work investigates the intrinsic coupling mechanism governing token representation evolution across Transformer blocks in large language models (LLMs) and its relationship with generalization. We propose a block-wise Jacobian linearization analysis based on forward trajectories to quantify singular vector alignment—both cross-layer and cross-token—and demonstrate that coupling strength exhibits a strong positive correlation with generalization performance, significantly outperforming conventional metrics such as parameter count and depth. We further uncover that coupling emerges progressively during training and grows exponentially with depth. The methodology is extended to Vision Transformers (ViTs), confirming its broad applicability. Empirical evaluation across multiple LLM and ViT scales shows that coupling degree serves as a robust predictor of generalization capability, offering novel theoretical foundations for model diagnosis, efficient training, and architecture design.
📝 Abstract
Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we analyze the trajectories of token embeddings as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. By examining the relationships between these block Jacobians, we uncover the phenomenon of extbf{transformer block coupling} in a multitude of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling extit{positively correlates} with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension. We further investigate how these properties emerge during training, observing a progressive development of coupling, increased linearity, and layer-wise exponential growth in token trajectories. Additionally, experiments with Vision Transformers (ViTs) corroborate the emergence of coupling and its relationship with generalization, reinforcing our findings in LLMs. Collectively, these insights offer a novel perspective on token interactions in transformers, opening new directions for studying their mechanisms as well as improving training and generalization.