Semi-Supervised In-Context Learning: A Baseline Study

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Prior work on in-context learning (ICL) predominantly relies on human-annotated demonstrations, overlooking the reliability and potential of self-generated labels. This paper introduces the first systematic semi-supervised ICL framework—IterPSD—comprising three stages: (1) large language model–driven self-labeling, (2) confidence-guided selection of high-quality pseudo-demonstrations, and (3) iterative pseudo-label refinement. We establish the first semi-supervised ICL baseline and uncover scaling laws under contexts containing up to thousands of demonstrations. Experiments show that Naive-SemiICL outperforms a 16-shot supervised baseline by 9.94% on average in low-data regimes; IterPSD further improves classification accuracy by up to 6.8%. Performance saturates with over 1,000 demonstrations. Our core contributions are: (i) formalizing semi-supervised ICL as a novel paradigm, (ii) proposing an iterative, optimization-aware pseudo-demonstration selection and refinement method, and (iii) empirically validating its substantial efficacy across diverse tasks.

Technology Category

Application Category

📝 Abstract

Most existing work in data selection for In-Context Learning (ICL) has focused on constructing demonstrations from ground truth annotations, with limited attention given to selecting reliable self-generated annotations. In this work, we propose a three-step semi-supervised ICL framework: annotation generation, demonstration selection, and semi-supervised inference. Our baseline, Naive-SemiICL, which prompts select high-confidence self-generated demonstrations for ICL prompting, outperforms a 16-shot baseline by an average of 9.94% across 16 datasets. We further introduce IterPSD, an annotation approach that refines pseudo-demonstrations iteratively, achieving up to 6.8% additional gains in classification tasks. Lastly, we reveal a scaling law for semi-supervised ICL, where models achieve optimal performance with over 1,000 demonstrations.

Problem

Research questions and friction points this paper is trying to address.

Improves In-Context Learning with self-generated annotations

Introduces IterPSD for refining pseudo-demonstrations iteratively

Reveals scaling law for optimal semi-supervised ICL performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-step semi-supervised ICL framework

Naive-SemiICL selects high-confidence self-generated demonstrations

IterPSD refines pseudo-demonstrations iteratively

🔎 Similar Papers

AMR-RE: Abstract Meaning Representations for Retrieval-Based In-Context Learning in Relation Extraction