How GPT learns layer by layer

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited capacity for constructing dynamic, generalizable world models, hindering their real-world decision-making. To address this, we employ OthelloGPT—a controlled experimental testbed—combined with sparse autoencoders (SAEs) and linear probes to systematically dissect how GPT-style models incrementally build internal world models across layers. We identify, for the first time, a hierarchical progression from static boundary perception to dynamic state modeling; further, we discover and decode a high-level semantic feature—“piece stability”—that underlies long-horizon planning. Empirically, SAEs outperform linear probes in extracting disentangled, compositional semantic representations. We precisely trace the depth-dependent emergence of both color identification and stability representations, revealing clear layer-wise developmental trajectories. This work establishes a novel interpretability paradigm for probing world-model formation within LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long-term planning. We study the progression of linear probe accuracy and tile color using both SAE's and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller language model, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: https://github.com/ALT-JS/OthelloSAE.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Flexible Understanding

Complex Real-World Environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

OthelloGPT Model Analysis

Feature Recognition in Game Strategy

Enhanced Understanding of Large Language Models

🔎 Similar Papers

The Remarkable Robustness of LLMs: Stages of Inference?