Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the degradation of in-context learning (ICL) capabilities in large language models (LLMs) during full-parameter fine-tuning, which often undermines few-shot generalization. Leveraging a linear attention framework, the authors theoretically elucidate how standard fine-tuning disrupts the mechanisms underlying ICL. To mitigate this issue, they propose a constrained fine-tuning strategy that updates only the value matrices, thereby enhancing zero-shot performance on the target task while effectively preserving few-shot learning abilities. Further analysis demonstrates that incorporating an auxiliary few-shot loss can amplify this benefit under certain conditions. Both theoretical insights and empirical results substantiate that restricting the set of tunable parameters offers a principled and effective approach to jointly optimizing zero-shot and in-context learning performance.

Technology Category

Application Category

📝 Abstract

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.

Problem

Research questions and friction points this paper is trying to address.

in-context learning

fine-tuning

attention models

catastrophic forgetting

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context learning

fine-tuning

linear attention