🤖 AI Summary
This work investigates token prediction mechanisms in two-layer Transformers under long-context conditions, focusing on the dynamic interplay between pretrained bigram knowledge and contextual cues. Methodologically, it integrates formal theoretical analysis with controlled experiments: using a two-layer Transformer architecture and prompts generated by a pretrained bigram language model, it models induction head behavior and derives closed-form logits expressions. Theoretically grounded logits decomposition reveals—formally for the first time—how induction heads “hijack” pretrained bigram preferences via associative memory, and systematically characterizes how context overrides and suppresses such preferences. Both theory and empirical validation confirm that context exerts directional control over pretrained knowledge activation, establishing knowledge hijacking as a core intrinsic mechanism of in-context learning (ICL). This yields the first verifiable, interpretable micro-mechanistic model of ICL.
📝 Abstract
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without fine-tuning by leveraging contextual information provided within a prompt. However, ICL relies not only on contextual clues but also on the global knowledge acquired during pretraining for the next token prediction. Analyzing this process has been challenging due to the complex computational circuitry of LLMs. This paper investigates the balance between in-context information and pretrained bigram knowledge in token prediction, focusing on the induction head mechanism, a key component in ICL. Leveraging the fact that a two-layer transformer can implement the induction head mechanism with associative memories, we theoretically analyze the logits when a two-layer transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of a two-layer transformer align with the theoretical results.