A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

πŸ“… 2025-10-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) face significant data efficiency bottlenecks in post-training, including prohibitively high human annotation costs and diminishing marginal returns from scaling training data. Method: This paper introduces the first systematic, data-centric taxonomy for LLM post-training, encompassing five technical pathways: data selection, quality enhancement, synthetic data generation, knowledge distillation-based compression, and self-evolving data ecosystems. It proposes a unified β€œdata-efficient post-training” methodology, synthesizing representative works while identifying core challenges and open problems; designs an extensible data ecosystem architecture to enable synergistic optimization across techniques; and releases an open, structured literature repository. Contribution/Results: The framework establishes a novel paradigm for improving data utilization and reducing resource barriers in LLM post-training, offering both theoretical foundations and practical guidance for efficient model development.

Technology Category

Application Category

πŸ“ Abstract
Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM
Problem

Research questions and friction points this paper is trying to address.

Addressing high costs of manual data annotation in LLM training
Overcoming diminishing returns from scaling training data volumes
Developing systematic methods for data-efficient LLM post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic survey of data-efficient LLM training methods
Taxonomy covering data selection and quality enhancement
Focus on synthetic data generation and self-evolving ecosystems
πŸ”Ž Similar Papers
No similar papers found.
J
Junyu Luo
State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University
B
Bohan Wu
State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University
X
Xiao Luo
University of California, Los Angeles
Zhiping Xiao
Zhiping Xiao
Postdoc at University of Washington
CSEDMML
Yiqiao Jin
Yiqiao Jin
Georgia Institute of Technology
LLMNatural Language ProcessingData MiningComputational Social Science
Rong-Cheng Tu
Rong-Cheng Tu
Nanyang Technological University
Image and Video RetrievalCross-modal RetrievalDeep Learning
Nan Yin
Nan Yin
Mohamed bin Zayed University of Artificial Intelligence
Graph Neural NetworksMachine LearningAI4Science
Y
Yifan Wang
University of International Business and Economics
Jingyang Yuan
Jingyang Yuan
Peking University
LLMAI for Science
W
Wei Ju
State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University
M
Ming Zhang
State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University