Learning to Adapt: In-Context Learning Beyond Stationarity

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This work addresses the limitations of conventional in-context learning under non-stationary settings, where the assumption of a stationary task distribution fails to capture time-varying target functions. Focusing on non-stationary regression tasks, the study investigates the in-context learning capabilities of Transformers and presents the first theoretical analysis for this setting. By introducing Gated Linear Attention (GLA) and modeling non-stationarity via a first-order autoregressive process, the authors demonstrate that GLA adaptively assigns higher weights to recent observations, thereby inducing a learnable recency bias. Theoretical results establish that GLA achieves strictly better training and test errors compared to standard linear attention. Empirical experiments further corroborate its effectiveness and superiority in non-stationary in-context learning scenarios.

Technology Category

Application Category

📝 Abstract
Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and rigorously characterize its advantages over standard linear attention in this dynamic setting. To model non-stationarity, we adopt a first-order autoregressive process and show that GLA achieves lower training and testing errors by adaptively modulating the influence of past inputs -- effectively implementing a learnable recency bias. Our theoretical findings are further supported by empirical results, which validate the benefits of gating mechanisms in non-stationary ICL tasks.
Problem

Research questions and friction points this paper is trying to address.

in-context learning
non-stationarity
transformer
regression
time-varying
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context learning
non-stationarity
gated linear attention
recency bias
transformer adaptation