Constrained belief updates explain geometric structures in transformer representations

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work investigates the emergent computational structures in Transformers performing next-token prediction and their explanatory mechanisms for representational geometric features. Method: We propose a theoretical framework of “architecture-constrained parallel Bayesian belief updating,” unifying optimal prediction principles with mechanistic interpretability. Leveraging hidden Markov model (HMM) construction, probability simplex analysis, attention inverse modeling, and constraint-based refinement of optimal prediction equations, we quantitatively predict attention distributions, OV-circuit vector orientations, and embedding manifold geometry. Contribution/Results: Our framework rigorously derives the geometric structure of attention patterns, OV-circuit vectors, and token embeddings, establishing their formal correspondence to Bayesian inference. On controlled HMM tasks, it successfully reproduces and explains canonical geometric representations—including cyclic dynamics and low-dimensional manifolds—demonstrating both quantitative accuracy and mechanistic interpretability of the theoretical predictions.

Technology Category

Application Category

📝 Abstract

What computational structures emerge in transformers trained on next-token prediction? In this work, we provide evidence that transformers implement constrained Bayesian belief updating -- a parallelized version of partial Bayesian inference shaped by architectural constraints. To do this, we integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models that generate rich geometric patterns in neural activations. We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure. We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail -- including the attention pattern, OV-vectors, and embedding vectors -- by modifying the equations for optimal future token predictions to account for the architectural constraints of attention. Our approach provides a principled lens on how gradient descent resolves the tension between optimal prediction and architectural design.

Problem

Research questions and friction points this paper is trying to address.

Transformers implement constrained Bayesian belief updating.

Attention heads create geometrically structured representations.

Gradient descent balances optimal prediction and architectural design.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained Bayesian belief updating

Mechanistic interpretability integration

Optimal prediction theory modification

🔎 Similar Papers

Transformers represent belief state geometry in their residual stream