Suppressing Final Layer Hidden State Jumps in Transformer Pretraining

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work identifies and systematically characterizes a prevalent phenomenon in pre-trained Transformer language models—abrupt angular shifts between the final-layer hidden states and input representations—which leads to imbalanced utilization of intermediate layers. To address this issue, the authors propose Jump Regularization (JREG), a lightweight, architecture-agnostic regularization technique that effectively constrains the magnitude of such angular jumps during pre-training. Experimental results across three Llama model scales demonstrate consistent performance improvements over baseline models when JREG is incorporated, confirming its efficacy and generalizability in promoting more balanced layer-wise representation learning.

Technology Category

Application Category

📝 Abstract

This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large ``jump''in the angular distance occurring in or around the final Transformer layer. To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training. Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers. Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.

Problem

Research questions and friction points this paper is trying to address.

Transformer

hidden state jumps

pretraining

angular distance

final layer

Innovation

Methods, ideas, or system contributions that make the work stand out.

jump-suppressing regularizer

hidden state jumps

Transformer pretraining