Improving the Scaling Laws of Synthetic Data with Deliberate Practice

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the pervasive diminishing returns in synthetic data augmentation, this paper proposes DP, a dynamic synthetic data generation framework inspired by the pedagogical principle of “deliberate practice.” Departing from conventional “generate-then-prune” paradigms, DP directly targets high-value sample distributions via three core mechanisms: dynamic difficulty adjustment, information-theoretic sample generation, and theory-guided sample selection. It is the first work to formally incorporate human learning principles—specifically, challenge-based training—into synthetic data generation and theoretically proves its positive impact on model scaling laws. Implemented as a lightweight plugin, DP seamlessly integrates with mainstream diffusion models and large language models. Experiments demonstrate significant efficiency gains: on ImageNet-100, it reduces required sample count and training iterations by 3.4× and 6×, respectively; on ImageNet-1K, reductions reach 8× in samples and 30% in iterations—while consistently outperforming state-of-the-art methods across all benchmarks.

Technology Category

Application Category

📝 Abstract

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

Problem

Research questions and friction points this paper is trying to address.

Improves synthetic data scaling efficiency

Focuses on informative synthetic samples

Reduces training samples and iterations significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic synthetic data generation

Focus on informative samples

Reduced training samples and iterations

🔎 Similar Papers

No similar papers found.