MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of efficient, semantics-aware data selection from heterogeneous sources during mid-training of large language models. The authors propose a source-aware data filtering framework that dynamically constructs evaluation criteria for each data source via a self-anchoring mechanism. By integrating source-grouped semantic analysis with knowledge distillation, the method trains a lightweight and scalable student scorer capable of adaptively filtering the entire corpus. Evaluated on a code mid-training task encompassing 21 distinct sources, the approach achieves performance on par with full-corpus training using only 50% of the tokens and significantly outperforms existing baselines across nine code benchmarks, demonstrating a strong balance among filtering efficiency, semantic precision, and scalability.

📝 Abstract

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

Problem

Research questions and friction points this paper is trying to address.

mid-training

data selection

source-aware

rubric

heterogeneous sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

mid-training

source-aware

rubric anchoring