Why Less is More (Sometimes): A Theory of Data Curation

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the “Less Is More” (LIMO) paradox—under what conditions does a carefully curated small-scale dataset outperform the full dataset, challenging the conventional “more data is better” paradigm. Method: We establish a theoretical framework characterizing the phase-transition curve between data quantity and quality, rigorously identifying the critical conditions under which data curation improves generalization. We propose a data selection model based on an imperfect oracle and derive error scaling laws for both label-agnostic and label-aware selection strategies. Results: Empirical validation on ImageNet demonstrates that curated subsets significantly improve classification accuracy and mitigate model collapse. This work provides the first unified theoretical explanation of the LIMO phenomenon and delivers computationally tractable optimization principles for data curation in large-model training.

Technology Category

Application Category

📝 Abstract
This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more''(Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Resolving the paradox of when using less data improves machine learning performance
Analyzing how curated data subsets outperform full datasets under specific conditions
Explaining contradictory data curation strategies in mathematical reasoning for LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical framework for data curation strategies
Scaling laws for label-agnostic and aware curation
Analytical conditions for curated dataset superiority
🔎 Similar Papers
No similar papers found.