A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work investigates the geometric structure and evolution of predictive information across layers in large language models. By constructing affine mappings as a geometric diagnostic tool and integrating representation lenses, singular subspace analysis, and Grassmann manifold trajectory tracking, the study reveals—for the first time—a universal three-phase structure governing the evolution of predictive subspaces within residual streams: Seeding Multiplexing, Hoisting Overriding, and Focal Convergence. Experiments across eight models ranging from 1B to 32B parameters demonstrate the prevalence of this triphasic pattern. Model depth primarily enhances candidate disambiguation capability, with the duration of the second phase scaling linearly with depth. Each phase also exerts a distinct and well-defined influence on the effective rank of representations.

📝 Abstract

We investigate the geometry of predictive information across the layers of large language models (LLMs). We repurpose representation lenses-learned affine maps trained to predict the next token from intermediate residual streams-as geometric diagnostic tools. Rather than asking what the model predicts at each layer, we ask where predictive information resides and how it evolves across depth. We define at each layer a predictive readout subspace as the dominant k-dimensional singular subspace of such a map on the d-dimensional residual stream (where k is a resolution parameter), and track its trajectory on the Grassmann manifold as a similarity profile across layers. The profile is well described by unimodal distributions exhibiting a rise, near-plateau, and descent; varying k from 1% to 50% of d traces a Pareto frontier between visibility and energy retention, yet the same structure emerges at all scales. Across eight models from two families (Qwen2.5 and OLMo2, 1B-32B), we identify three geometric phases. Updates are approximately orthogonal to the residual stream throughout; what distinguishes the phases is their effect on the effective rank, which expands, stabilizes, and concentrates. In the first, Seeding Multiplexing, feed-forward memories and attention layers seed a candidate set in superposition in family-specific proportions, with the final token rising as leading candidate from 20% to 35% of positions across this phase. In the second, Hoisting Overriding, updates override existing subspaces to concentrate the candidate distribution without expanding the rank. In the third, Focal Convergence, high-energy low-rank updates write the winner into a form aligned with the unembedding direction. Phases 1 and 3 grow slowly with model depth, while Phase 2 expands linearly. The additional capacity of deeper LLMs is largely absorbed by candidate disambiguation.

Problem

Research questions and friction points this paper is trying to address.

predictive information

geometric phases

large language models

residual stream

Grassmann manifold

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric phase

predictive subspace

Grassmann manifold