In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

How do Transformers implicitly adapt to task complexity during in-context learning (ICL), without parameter updates? Method: We construct a controllable, multi-level task environment grounded in Markov chains and linear regression, and employ Bayesian modeling alongside ablation studies on model scale and training data distribution. Contribution/Results: We provide the first empirical evidence that Transformers inherently balance goodness-of-fit against complexity penalties—implicitly adhering to Occam’s razor—by selecting the simplest sufficient explanation among multiple compatible hypotheses. This emergent inductive bias operates entirely within the ICL paradigm, requiring no fine-tuning or gradient updates. The phenomenon is robustly replicated and validated on GPT-4 using Boolean function tasks: the model accurately infers true task complexity, reliably estimates underlying parameters, and consistently converges to the minimal-complexity hypothesis under ambiguity. These findings reveal a fundamental, architecture-intrinsic mechanism for complexity-aware generalization in large language models.

Technology Category

Application Category

📝 Abstract

In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity environments, practical language models encounter tasks spanning diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design well-controlled testbeds based on Markov chains and linear regression that reveal transformers not only identify the appropriate complexity level for each task but also accurately infer the corresponding parameters--even when the in-context examples are compatible with multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam's razor by balancing model fit against complexity penalties. We further ablate on the roles of model size, training mixture distribution, inference context length, and architecture. Finally, we validate this Occam's razor-like inductive bias on a pretrained GPT-4 model with Boolean-function tasks as case study, suggesting it may be inherent to transformers trained on diverse task distributions.

Problem

Research questions and friction points this paper is trying to address.

How transformers choose simpler hypotheses in-context

Transformers' handling of hierarchical task complexity

In-context Bayesian Occam's razor in transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers adapt via in-context learning without updates

Transformers favor simplest sufficient task explanations

Bayesian framework explains transformers' complexity balancing

🔎 Similar Papers

Racing Thoughts: Explaining Large Language Model Contextualization Errors