🤖 AI Summary
Contextual learning and activation steering in large language models (LLMs) appear heterogeneous but fundamentally share the same behavioral control mechanism—modulating the model’s belief about latent concepts.
Method: We propose a unified Bayesian framework that formalizes in-context learning as evidence accumulation and activation steering as prior distribution adjustment, enabling additive interventions in log-belief space.
Contribution/Results: This framework yields, for the first time, a closed-form behavioral prediction model, uncovering phase-transition phenomena and critical-point bifurcation mechanisms induced by interventions. Across diverse domains, it accurately reproduces and predicts key empirical regularities—including sigmoidal learning curves and superposition effects—demonstrating strong quantitative agreement with observed LLM behavior. The theory provides an interpretable, computationally tractable foundation for controllable LLM reasoning, bridging high-level behavioral phenomena with low-level probabilistic inference principles.
📝 Abstract
Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.