🤖 AI Summary
This work investigates the internal mechanisms enabling task-level generalization in large language models (LLMs) via in-context learning, focusing on a counterfactual arithmetic task (“plus-one addition”, e.g., 1+1=3). Using circuit-based interpretability techniques—particularly path patching—the authors systematically identify and dissect the computational structures underpinning task transfer. They find that LLMs do not merely memorize input-output patterns; instead, they activate reusable “function-induction mechanisms” that elevate low-level induction heads into higher-order functional abstractions. Multiple attention heads dynamically coordinate to form an adaptive computational circuit, performing symbolic remapping of arithmetic rules. This mechanism exhibits strong generalization across diverse synthetic and algorithmic tasks, as well as cross-task reusability. Crucially, the study provides the first causal-intervention-based, mechanistic explanation for task-level generalization in LLMs—grounded in fine-grained circuit analysis rather than behavioral observation.
📝 Abstract
Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their notable performance and present three key findings. First, we uncover a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.