Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory

📅 2024-12-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates token prediction mechanisms in two-layer Transformers under long-context conditions, focusing on the dynamic interplay between pretrained bigram knowledge and contextual cues. Methodologically, it integrates formal theoretical analysis with controlled experiments: using a two-layer Transformer architecture and prompts generated by a pretrained bigram language model, it models induction head behavior and derives closed-form logits expressions. Theoretically grounded logits decomposition reveals—formally for the first time—how induction heads “hijack” pretrained bigram preferences via associative memory, and systematically characterizes how context overrides and suppresses such preferences. Both theory and empirical validation confirm that context exerts directional control over pretrained knowledge activation, establishing knowledge hijacking as a core intrinsic mechanism of in-context learning (ICL). This yields the first verifiable, interpretable micro-mechanistic model of ICL.

Technology Category

Application Category

📝 Abstract

In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without fine-tuning by leveraging contextual information provided within a prompt. However, ICL relies not only on contextual clues but also on the global knowledge acquired during pretraining for the next token prediction. Analyzing this process has been challenging due to the complex computational circuitry of LLMs. This paper investigates the balance between in-context information and pretrained bigram knowledge in token prediction, focusing on the induction head mechanism, a key component in ICL. Leveraging the fact that a two-layer transformer can implement the induction head mechanism with associative memories, we theoretically analyze the logits when a two-layer transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of a two-layer transformer align with the theoretical results.

Problem

Research questions and friction points this paper is trying to address.

Understand how induction heads balance in-context and pretrained knowledge

Analyze transformer weight matrices in associative memory mechanisms

Evaluate model performance on bigram-generated prompts theoretically and empirically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-layer transformer captures in-context information

Balances bigram knowledge in token prediction

Analyzes attention weight matrices theoretically

🔎 Similar Papers

No similar papers found.

Authors to Follow