Understanding Generalization and Forgetting in In-Context Continual Learning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses a critical gap in existing theory, which has largely focused on in-context learning for single tasks, leaving unclear whether large language models implicitly achieve continual learning when sequentially processing multiple tasks within a single prompt—particularly regarding the mechanisms of generalization and forgetting. The paper introduces the first theoretical framework for in-context continual learning, analyzing how pretrained Transformers process task sequences under shared attention. It derives an expression for prediction error and proposes a bias-variance-interference decomposition. The analysis reveals that standard attention inevitably induces interference across tasks due to its uniform or causal aggregation of historical context, elucidates conditions for positive and negative transfer, and identifies its fundamental limitation: sequential sensitivity and performance degradation in long prompts. This provides a theoretical foundation for understanding generalization and forgetting in in-context continual learning.

📝 Abstract

In-context learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequential tasks within a single prompt through shared attention mechanisms. Focusing on linear and masked linear self-attention, we derive error expressions for model predictions under sequential task prompts and analyze their generalization and forgetting behavior. Our results reveal that standard attention mechanisms inevitably induce intertask interference by uniformly or causally aggregating historical contexts, leading to systematic bias. We further provide a bias-variance-interference decomposition of prediction error, characterizing when historical in-context information yields positive transfer or provable negative transfer. This analysis exposes fundamental limits of attention-based continual inference and offers theoretical explanations for order sensitivity and performance degradation in long prompts.

Problem

Research questions and friction points this paper is trying to address.

in-context learning

continual learning

generalization

catastrophic forgetting

attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context continual learning

attention mechanism

task interference