🤖 AI Summary
The prevailing “more data is better” paradigm in large language models (LLMs) overlooks task-specific diminishing returns and fails to identify which tasks genuinely benefit from scale. Method: We propose a data topology-driven framework for data expansion decisions, integrating data topological analysis, task sensitivity modeling, and computational efficiency evaluation—marking the first effort to ground expansion decisions in intrinsic data structural properties. Contributions: (1) We challenge the heuristic of indiscriminate data scaling by establishing a task-oriented paradigm for quantifying data value; (2) we provide interpretable theoretical foundations for high-value data acquisition, parameter-efficient training strategies, and heterogeneous compute architecture design; and (3) we advance the shift from purely “data-driven” computation toward a “task–data co-driven” paradigm, enabling principled, resource-aware LLM development.
📝 Abstract
While Large Language Models require more and more data to train and scale, rather than looking for any data to acquire, we should consider what types of tasks are more likely to benefit from data scaling. We should be intentional in our data acquisition. We argue that the topology of data itself informs which tasks to prioritize in data scaling, and shapes the development of the next generation of compute paradigms for tasks where data scaling is inefficient, or even insufficient.