From Selection to Generation: A Survey of LLM-based Active Learning

๐Ÿ“… 2025-02-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the scarcity of high-quality labeled data and the efficiency bottlenecks of conventional active learning (AL), this paper proposes a novel โ€œSelectโ†’Generateโ€ dual-track AL paradigm, elevating large language models (LLMs) from auxiliary discriminators to autonomous data producers. Methodologically, it integrates prompt engineering, instruction tuning, synthetic data generation, uncertainty estimation, and multi-round human-in-the-loop evaluation to reconstruct AL pipelines across tasks and modalities. Key contributions include: (1) establishing the first unified taxonomy of LLM-augmented AL, synthesizing insights from 120+ studies; (2) systematically characterizing the transformative roles of LLMs in sample selection, generative annotation, and closed-loop optimization; and (3) identifying six open challenges, alongside proposing a reusable, low-label-cost LLM training methodology and practical implementation guidelines.

Technology Category

Application Category

๐Ÿ“ Abstract
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.
Problem

Research questions and friction points this paper is trying to address.

Enhance model efficiency with LLM-based Active Learning.
Generate new data instances using Large Language Models.
Survey LLM impacts on Active Learning across domains.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhance data selection
LLMs generate new data instances
LLMs provide cost-effective annotations
Y
Yu Xia
University of California San Diego
Subhojyoti Mukherjee
Subhojyoti Mukherjee
Adobe Research
Multi-armed BanditsReinforcement LearningLarge Language ModelsRLHF
Zhouhang Xie
Zhouhang Xie
University of California, San Diego
natural language processingmachine learningrecommender systems
Junda Wu
Junda Wu
University of California San Diego
Natural Language ProcessingRecommender SystemMultimodal LearningReinforcement Learning
X
Xintong Li
University of California San Diego
R
Ryan Aponte
Carnegie Mellon University
Hanjia Lyu
Hanjia Lyu
University of Rochester
AI and SocietyMultimodal LLMsGraph LearningComputational Social ScienceHealth Informatics
Joe Barrow
Joe Barrow
Pattern Data
Natural Language Processing
H
Hongjie Chen
Dolby Labs
Franck Dernoncourt
Franck Dernoncourt
NLP/ML Researcher. MIT PhD.
Machine LearningNeural NetworksNatural Language Processing
Branislav Kveton
Branislav Kveton
Adobe Research
Artificial IntelligenceMachine Learning
Tong Yu
Tong Yu
Adobe Research
R
Ruiyi Zhang
Adobe Research
Jiuxiang Gu
Jiuxiang Gu
Adobe Research
Computer VisionNatural Language ProcessingMachine Learning
Nesreen K. Ahmed
Nesreen K. Ahmed
Senior Principal Scientist, Cisco AI Research, Intel Labs, Purdue University
Geometric Deep LearningGraph Representation LearningML for SystemsML4code
Y
Yu Wang
University of Oregon
X
Xiang Chen
Adobe Research
H
Hanieh Deilamsalehy
Adobe Research
Sungchul Kim
Sungchul Kim
Adobe
Data miningMachine learningBioinformatics
Zhengmian Hu
Zhengmian Hu
Adobe Research
Deep LearningMonte Carlo
Y
Yue Zhao
University of Southern California
Nedim Lipka
Nedim Lipka
Adobe Systems Inc
Big Data AnalyticsMachine LearningWeb MiningOnline Advertisement
Seunghyun Yoon
Seunghyun Yoon
Assistant Professor, Korea Institute of Energy Technology (KENTECH)
Reinforcement LearningDeep LearningData ScienceNetworkingCyber Security
T
Ting-Hao Kenneth Huang
Pennsylvania State University
Zichao Wang
Zichao Wang
Adobe Research
document AIAI for educationnatural language processingmachine learning
P
Puneet Mathur
Adobe Research
Soumyabrata Pal
Soumyabrata Pal
Adobe Research, India
LLM EfficiencyMachine Learning TheoryApplied Statistics
Koyel Mukherjee
Koyel Mukherjee
Adobe Research
AlgorithmsDeep LearningOptimizationOnline learning
Zhehao Zhang
Zhehao Zhang
The Ohio State University
Natural Language Processing
Namyong Park
Namyong Park
Meta AI
Machine LearningRepresentation LearningGraph LearningKnowledge ReasoningComplex Networks
Thien Huu Nguyen
Thien Huu Nguyen
University of Oregon
Information ExtractionDeep LearningNatural Language ProcessingMachine Learning
J
Jiebo Luo
University of Rochester
Ryan A. Rossi
Ryan A. Rossi
Adobe Research
Machine LearningPersonalizationGraph Representation LearningGraph MLGraph Theory
Julian McAuley
Julian McAuley
Professor, UC San Diego
Recommender SystemsNatural Language ProcessingPersonalizationComputer Music