When and How Unlabeled Data Provably Improve In-Context Learning

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates effective utilization of unlabeled data in in-context learning (ICL), particularly under semi-supervised settings where demonstration labels are missing or erroneous. Methodologically, it employs Gaussian mixture modeling to theoretically analyze behavioral differences between linear attention and multi-layer/recurrent Transformers on unlabeled data. It establishes, for the first time, that multi-layer/recurrent Transformers implicitly construct polynomial estimators—e.g., ∑a_i(XᵀX)ⁱXᵀy—via depth, enabling unlabeled-data-driven performance gains, and formally links this mechanism to the EM algorithm, revealing exponential growth of dominant polynomial degree with network depth. Furthermore, it proposes a pseudo-labeling strategy based on iterative invocation of pretrained tabular foundation models, achieving significant improvements over single-pass inference on real-world datasets.

Technology Category

Application Category

📝 Abstract
Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $sum_{ige 0} a_i (X^ op X)^iX^ op y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.
Problem

Research questions and friction points this paper is trying to address.

Analyzes in-context learning with missing or incorrect labels
Compares performance of single-layer vs multilayer transformers
Proposes looping transformers to enhance semi-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilayer transformers leverage unlabeled data effectively
Looped transformers enhance semi-supervision via polynomial estimators
Exponential depth boosts semi-supervised tabular learning performance
🔎 Similar Papers