🤖 AI Summary
Existing in-context learning (ICL) methods predominantly rely on semantic similarity to retrieve top-K exemplars, yet this often yields label-inconsistent demonstrations that impair generalization. We identify this issue as an implicit transductive label propagation problem and, for the first time, formulate ICL from a Bayesian perspective—jointly modeling concept-guided retrieval and label estimation under an error-bounded label propagation framework. Based on this formulation, we propose TopK-SD, a label-consistency-driven sampling method that jointly optimizes semantic similarity and label distribution modeling via synthetic data augmentation. Evaluated across multiple NLP benchmarks, TopK-SD consistently outperforms standard top-K retrieval, empirically validating the critical role of label consistency in ICL performance. Our work establishes a novel analytical paradigm for understanding the intrinsic mechanisms of ICL, bridging conceptual grounding with reliable label inference.
📝 Abstract
Large language models (LLMs) perform in-context learning (ICL) with minimal supervised examples, which benefits various natural language processing (NLP) tasks. One of the critical research focus is the selection of prompt demonstrations. Current approaches typically employ retrieval models to select the top-K most semantically similar examples as demonstrations. However, we argue that existing methods are limited since the label consistency is not guaranteed during demonstration selection. Our cognition derives from the Bayesian view of ICL and our rethinking of ICL from the transductive label propagation perspective. We treat ICL as a transductive learning method and incorporate latent concepts from Bayesian view and deduce that similar demonstrations guide the concepts of query, with consistent labels serving as estimates. Based on this understanding, we establish a label propagation framework to link label consistency with propagation error bounds. To model label consistency, we propose a data synthesis method, leveraging both semantic and label information, and use TopK sampling with Synthetic Data (TopK-SD) to acquire demonstrations with consistent labels. TopK-SD outperforms original TopK sampling on multiple benchmarks. Our work provides a new perspective for understanding the working mechanisms within ICL.