Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

The prevailing “more data is better” paradigm in large language models (LLMs) overlooks task-specific diminishing returns and fails to identify which tasks genuinely benefit from scale. Method: We propose a data topology-driven framework for data expansion decisions, integrating data topological analysis, task sensitivity modeling, and computational efficiency evaluation—marking the first effort to ground expansion decisions in intrinsic data structural properties. Contributions: (1) We challenge the heuristic of indiscriminate data scaling by establishing a task-oriented paradigm for quantifying data value; (2) we provide interpretable theoretical foundations for high-value data acquisition, parameter-efficient training strategies, and heterogeneous compute architecture design; and (3) we advance the shift from purely “data-driven” computation toward a “task–data co-driven” paradigm, enabling principled, resource-aware LLM development.

Technology Category

Application Category

📝 Abstract

While Large Language Models require more and more data to train and scale, rather than looking for any data to acquire, we should consider what types of tasks are more likely to benefit from data scaling. We should be intentional in our data acquisition. We argue that the topology of data itself informs which tasks to prioritize in data scaling, and shapes the development of the next generation of compute paradigms for tasks where data scaling is inefficient, or even insufficient.

Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence

Big Data Utilization

Language Model Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Purposeful Data Expansion

Task-Specific Data Selection

Novel Problem-Solving Methods

🔎 Similar Papers

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI