🤖 AI Summary
Existing diversity metrics are model-agnostic and thus fail to capture the true drivers of generalization. This work addresses this limitation by proposing: (1) G-Vendi—a novel diversity metric grounded in gradient-space entropy, which for the first time explicitly models gradient-space structure as the core representation of diversity and exhibits strong correlation with out-of-distribution (OOD) generalization (ρ ≈ 0.9); and (2) the Prismatic Synthesis framework, the first synthesis paradigm guided by sparsity in the gradient space, enabling efficient generation of high-value synthetic data using small models. Experiments demonstrate that PrismMath-7B outperforms R1-Distill-Qwen-7B on 6 of 7 challenging benchmarks. Moreover, augmenting training data with Prismatic Synthesis consistently improves both in-distribution (ID) and OOD performance—surpassing state-of-the-art generators trained on 20× more data.
📝 Abstract
Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's $
ho approx 0.9$) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.