Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study investigates the distribution truncation effect induced by LLM generation mechanisms—such as top-p sampling, temperature scaling, and finite sampling—during mixed training on real and synthetic data, and its detrimental impact on long-tail knowledge acquisition. We theoretically model and empirically analyze this phenomenon, revealing for the first time a three-stage dynamic scaling law governing LLM performance under mixed-data training. Building upon this, we derive the first generalization error upper bound specifically tailored to real–synthetic hybrid settings. Leveraging this bound, we design a lightweight data value estimation framework that explicitly accounts for distribution truncation. Our method substantially alleviates long-tail knowledge loss: it achieves superior data valuation accuracy over state-of-the-art methods across four diverse downstream tasks, while reducing computational overhead by an order of magnitude.

Technology Category

Application Category

📝 Abstract

The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

Problem

Research questions and friction points this paper is trying to address.

Analyzing scaling dynamics of LLMs with real-synthetic data mixtures

Addressing distributional discrepancies in synthetic data affecting tail knowledge

Developing efficient data valuation method for mixed datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies three-phase scaling behavior with breakpoints

Derives generalization bound for real-synthetic data mixtures

Proposes efficient data valuation method for large datasets

🔎 Similar Papers

AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs