From Selection to Generation: A Survey of LLM-based Active Learning

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the scarcity of high-quality labeled data and the efficiency bottlenecks of conventional active learning (AL), this paper proposes a novel “Select→Generate” dual-track AL paradigm, elevating large language models (LLMs) from auxiliary discriminators to autonomous data producers. Methodologically, it integrates prompt engineering, instruction tuning, synthetic data generation, uncertainty estimation, and multi-round human-in-the-loop evaluation to reconstruct AL pipelines across tasks and modalities. Key contributions include: (1) establishing the first unified taxonomy of LLM-augmented AL, synthesizing insights from 120+ studies; (2) systematically characterizing the transformative roles of LLMs in sample selection, generative annotation, and closed-loop optimization; and (3) identifying six open challenges, alongside proposing a reusable, low-label-cost LLM training methodology and practical implementation guidelines.

Technology Category

Application Category

📝 Abstract

Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.

Problem

Research questions and friction points this paper is trying to address.

Enhance model efficiency with LLM-based Active Learning.

Generate new data instances using Large Language Models.

Survey LLM impacts on Active Learning across domains.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhance data selection

LLMs generate new data instances

LLMs provide cost-effective annotations

🔎 Similar Papers

ActiveLLM: Large Language Model-based Active Learning for Textual Few-Shot Scenarios