DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to data selection, mixture optimization, and reweighting are fragmented and lack standardized interfaces, hindering reproducibility and fair comparison. This work proposes the first unified dynamic data training framework that integrates sample selection, domain mixing, and reweighting within a modular architecture. By introducing an extensible trainer abstraction, the framework enables plug-and-play compatibility with standard large language model training pipelines and distributed optimization techniques such as DeepSpeed ZeRO-3. Built upon LLaMA-Factory, it supports model-dependent operations including embedding extraction, inference, and gradient computation. Experiments show that dynamic data selection outperforms static full-data training on Mistral-7B and Llama-3.2-3B, while DoReMi and ODM significantly improve MMLU accuracy and corpus perplexity—and enhance training efficiency—during pretraining of Qwen2.5-1.5B.
📝 Abstract
Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.
Problem

Research questions and friction points this paper is trying to address.

data-centric training
dynamic data optimization
LLM training
reproducibility
data mixture
Innovation

Methods, ideas, or system contributions that make the work stand out.

data-centric training
dynamic data optimization
unified framework
large language models
modular training
🔎 Similar Papers
No similar papers found.
Hao Liang
Hao Liang
Peking University
Data Centric Machine LearningLarge Language ModelsMultimodal Large Language Models
Z
Zhengyang Zhao
Peking University
M
Meiyi Qiang
Peking University
Mingrui Chen
Mingrui Chen
Institute of Automation, Chinese Academy of Sciences
Computer VisionFoundation Models
L
Lu Ma
Institute for Advanced Algorithms Research, Shanghai
R
Rongyi Yu
OriginHub Technology
H
Hengyi Feng
LLaMA-Factory Team
S
Shixuan Sun
Zhongguancun Academy
Z
Zimo Meng
OpenDataLab
X
Xiaochen Ma
Shanghai Artificial Intelligence Laboratory
X
Xuanlin Yang
Shanghai Jiao Tong University
Q
Qifeng Cai
Peking University
Ruichuan An
Ruichuan An
Xi'an Jiaotong University|Peking University
VLMData Centric AI
Bohan Zeng
Bohan Zeng
PhD student, Peking University
Data-Centric AIComputer VisionDiffusion Model3D
Z
Zhen Hao Wong
LLaMA-Factory Team
C
Chengyu Shen
Zhongguancun Academy
R
Runming He
OpenDataLab
Zhaoyang Han
Zhaoyang Han
Nanjing Forestry University
Yaowei Zheng
Yaowei Zheng
Ph.D. student, Beihang University
Machine LearningNatural Language Processing
Fangcheng Fu
Fangcheng Fu
Shanghai Jiao Tong University
machine learningdeep learningMLSysdistributed computation
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
B
Bin Cui
OriginHub Technology
Zhiyu Li
Zhiyu Li
Tianjin University
Robust controlattitude control
Weinan E
Weinan E
Professor of Mathematics, Princeton University
applied mathematics
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved