Demystifying Data Organization for Enhanced LLM Training

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the lack of systematic investigation into data organization strategies in large language model training, which hinders both training efficiency and model performance. It formally introduces four data organization principles—boundary sharpening, cyclic scheduling, curriculum continuity, and local diversity—and proposes two low-overhead, highly generalizable data sequencing methods, STR and SAW, based on precomputed sample scores. These methods integrate curriculum learning with diversity-aware scheduling to enable effective data reordering without incurring additional training costs. Extensive experiments across models and datasets of varying scales demonstrate that the proposed approach significantly enhances the stability and final performance of both pretraining and supervised fine-tuning.

📝 Abstract

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/

Problem

Research questions and friction points this paper is trying to address.

data organization

LLM training

training efficiency

data curation

data ordering

Innovation

Methods, ideas, or system contributions that make the work stand out.

data organization

LLM training

curriculum learning

data ordering

training efficiency

🔎 Similar Papers

No similar papers found.

ByteDance

圣何塞

Authors to Follow