UniGeM: Unifying Data Mixing and Selection via Geometric Exploration and Mining

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitation of large language model scaling imposed by low-quality data, where existing approaches treat data mixing and sample selection in isolation, thereby disrupting the structural consistency of code corpora. The paper formulates data curation as a manifold approximation problem and introduces a hierarchical strategy combining macro-level exploration and micro-level mining: stability-based clustering determines optimal mixture weights, while geometric distribution-guided filtering selects high-quality samples. This unified framework jointly optimizes data mixing and selection without relying on proxy models or external references. Evaluated on 100B tokens, the method trains 8B and 16B sparse Mixture-of-Experts (MoE) models that achieve twice the data efficiency of prior approaches and outperform state-of-the-art methods on reasoning and multilingual benchmarks.

Technology Category

Application Category

📝 Abstract

The scaling of Large Language Models (LLMs) is increasingly limited by data quality. Most methods handle data mixing and sample selection separately, which can break the structure in code corpora. We introduce \textbf{UniGeM}, a framework that unifies mixing and selection by treating data curation as a \textit{manifold approximation} problem without training proxy models or relying on external reference datasets. UniGeM operates hierarchically: \textbf{Macro-Exploration} learns mixing weights with stability-based clustering; \textbf{Micro-Mining} filters high-quality instances by their geometric distribution to ensure logical consistency. Validated by training 8B and 16B MoE models on 100B tokens, UniGeM achieves \textbf{2.0$\times$ data efficiency} over a random baseline and further improves overall performance compared to SOTA methods in reasoning-heavy evaluations and multilingual generalization.

Problem

Research questions and friction points this paper is trying to address.

data quality

data mixing

sample selection

code corpora

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

manifold approximation

data curation

geometric exploration