Distinct mechanisms underlying in-context learning in transformers

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work investigates how Transformers dynamically adapt their computational mechanisms during in-context learning based on the statistical properties of input data. By training Transformers to process finite discrete Markov chains and integrating subcircuit analysis, minimal model construction, symmetry-constrained dynamical modeling, and loss landscape analysis, the study provides the first complete characterization of four distinct algorithmic phases underlying in-context learning. It identifies two fundamentally different adaptive computation mechanisms, precisely delineates the critical boundaries—denoted $K_1^*$ and $K_2^*$—separating memorization from generalization regimes, reveals a sharp transition in generalization capability from single-point to two-point statistics, and demonstrates that data diversity plays a decisive role in determining which computational mechanism is employed.

Technology Category

Application Category

📝 Abstract

Modern distributed networks, notably transformers, acquire a remarkable ability (termed `in-context learning') to adapt their computation to input statistics, such that a fixed network can be applied to data from a broad range of systems. Here, we provide a complete mechanistic characterization of this behavior in transformers trained on a finite set $S$ of discrete Markov chains. The transformer displays four algorithmic phases, characterized by whether the network memorizes and generalizes, and whether it uses 1-point or 2-point statistics. We show that the four phases are implemented by multi-layer subcircuits that exemplify two qualitatively distinct mechanisms for implementing context-adaptive computations. Minimal models isolate the key features of both motifs. Memorization and generalization phases are delineated by two boundaries that depend on data diversity, $K = |S|$. The first ($K_1^\ast$) is set by a kinetic competition between subcircuits and the second ($K_2^\ast$) is set by a representational bottleneck. A symmetry-constrained theory of a transformer's training dynamics explains the sharp transition from 1-point to 2-point generalization and identifies key features of the loss landscape that allow the network to generalize. Put together, we show that transformers develop distinct subcircuits to implement in-context learning and identify conditions that favor certain mechanisms over others.

Problem

Research questions and friction points this paper is trying to address.

in-context learning

transformers

Markov chains

generalization

memorization

Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context learning

transformer mechanisms

algorithmic phases