Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Contextual learning and activation steering in large language models (LLMs) appear heterogeneous but fundamentally share the same behavioral control mechanism—modulating the model’s belief about latent concepts. Method: We propose a unified Bayesian framework that formalizes in-context learning as evidence accumulation and activation steering as prior distribution adjustment, enabling additive interventions in log-belief space. Contribution/Results: This framework yields, for the first time, a closed-form behavioral prediction model, uncovering phase-transition phenomena and critical-point bifurcation mechanisms induced by interventions. Across diverse domains, it accurately reproduces and predicts key empirical regularities—including sigmoidal learning curves and superposition effects—demonstrating strong quantitative agreement with observed LLM behavior. The theory provides an interpretable, computationally tractable foundation for controllable LLM reasoning, bridging high-level behavioral phenomena with low-level probabilistic inference principles.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.

Problem

Research questions and friction points this paper is trying to address.

Unifying prompt-based and activation-based LLM control methods

Explaining behavioral changes through Bayesian belief dynamics

Predicting intervention effects on model behavior shifts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian model unifying in-context learning and activation steering

Altering latent concept beliefs through priors and evidence

Predicting behavioral shifts via additive log-belief interventions

🔎 Similar Papers

Is In-Context Learning a Type of Error-Driven Learning? Evidence from the Inverse Frequency Effect in Structural Priming