🤖 AI Summary
To address the high annotation cost of supervised fine-tuning (SFT) for large language models (LLMs) in domain-specific scenarios, this paper proposes a label-efficient data selection method centered on **task diversity**—a departure from conventional prompt-diversity-focused strategies. We introduce an **inverse-confidence-weighted cross-task sampling mechanism**, which leverages pre-trained model confidence disparities across tasks to prioritize labeling of low-confidence, high-information samples. Integrated into the SFT pipeline, our method achieves substantial generalization gains using only a small fraction of annotated data. Experiments demonstrate a 4% accuracy improvement over full-data training on the MMLU benchmark, with up to 80% reduction in annotation cost. Moreover, our approach consistently outperforms state-of-the-art methods across multiple datasets and annotation budget settings.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation -- a process that is time-consuming, labor-intensive, and expensive. In this paper, we address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection. This is markedly different from existing methods based on the prompt-diversity. Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks. We combine these facts to devise a simple yet effective sampling strategy: we select examples across tasks using an inverse confidence weighting strategy. This produces models comparable to or better than those trained with more complex sampling procedures, while being significantly easier to implement and less computationally intensive. Notably, our experimental results demonstrate that this method can achieve better accuracy than training on the complete dataset (a 4% increase in MMLU score). Across various annotation budgets and two instruction finetuning datasets, our algorithm consistently performs at or above the level of the best existing methods, while reducing annotation costs by up to 80%.