Entropy-Based Data Selection for Language Models

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of data redundancy and difficulty in assessing data utility during language model fine-tuning under computational constraints. The authors propose an Entropy-based Unsupervised Data Selection (EUDS) framework that leverages information entropy to quantify sample uncertainty and establishes a lightweight filtering mechanism to automatically identify high-value training instances without requiring additional annotations. By integrating entropy-driven data selection with fine-tuning efficiency optimization—a novel combination in this domain—the method substantially reduces both the required training data volume and computational overhead across multiple downstream tasks, including sentiment analysis, topic classification, and question answering, while maintaining or even improving model performance.

Technology Category

Application Category

📝 Abstract
Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.
Problem

Research questions and friction points this paper is trying to address.

data selection
language models
computational efficiency
fine-tuning
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy-based selection
unsupervised data selection
language model fine-tuning
computational efficiency
data filtering
🔎 Similar Papers
No similar papers found.