Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contextual learning and activation steering in large language models (LLMs) appear heterogeneous but fundamentally share the same behavioral control mechanism—modulating the model’s belief about latent concepts. Method: We propose a unified Bayesian framework that formalizes in-context learning as evidence accumulation and activation steering as prior distribution adjustment, enabling additive interventions in log-belief space. Contribution/Results: This framework yields, for the first time, a closed-form behavioral prediction model, uncovering phase-transition phenomena and critical-point bifurcation mechanisms induced by interventions. Across diverse domains, it accurately reproduces and predicts key empirical regularities—including sigmoidal learning curves and superposition effects—demonstrating strong quantitative agreement with observed LLM behavior. The theory provides an interpretable, computationally tractable foundation for controllable LLM reasoning, bridging high-level behavioral phenomena with low-level probabilistic inference principles.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.
Problem

Research questions and friction points this paper is trying to address.

Unifying prompt-based and activation-based LLM control methods
Explaining behavioral changes through Bayesian belief dynamics
Predicting intervention effects on model behavior shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian model unifying in-context learning and activation steering
Altering latent concept beliefs through priors and evidence
Predicting behavioral shifts via additive log-belief interventions
E
Eric Bigelow
Goodfire AI
Daniel Wurgaft
Daniel Wurgaft
PhD Student, Stanford University
Cognitive ScienceGeneralizationIn-Context LearningReasoningLanguage Models
Y
YingQiao Wang
Department of Psychology, Harvard University
N
Noah Goodman
Department of Psychology, Stanford University
Tomer Ullman
Tomer Ullman
Assistant Professor, Harvard
Cognitive ScienceComputational ModelingCognitive DevelopmentArtificial Intelligence
H
Hidenori Tanaka
Physics of Intelligence Group, NTT Research
Ekdeep Singh Lubana
Ekdeep Singh Lubana
Goodfire AI
AIMachine LearningDeep Learning