A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Large language models (LLMs) face significant data efficiency bottlenecks in post-training, including prohibitively high human annotation costs and diminishing marginal returns from scaling training data. Method: This paper introduces the first systematic, data-centric taxonomy for LLM post-training, encompassing five technical pathways: data selection, quality enhancement, synthetic data generation, knowledge distillation-based compression, and self-evolving data ecosystems. It proposes a unified “data-efficient post-training” methodology, synthesizing representative works while identifying core challenges and open problems; designs an extensible data ecosystem architecture to enable synergistic optimization across techniques; and releases an open, structured literature repository. Contribution/Results: The framework establishes a novel paradigm for improving data utilization and reducing resource barriers in LLM post-training, offering both theoretical foundations and practical guidance for efficient model development.

Technology Category

Application Category

📝 Abstract

Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM

Problem

Research questions and friction points this paper is trying to address.

Addressing high costs of manual data annotation in LLM training

Overcoming diminishing returns from scaling training data volumes

Developing systematic methods for data-efficient LLM post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic survey of data-efficient LLM training methods

Taxonomy covering data selection and quality enhancement

Focus on synthetic data generation and self-evolving ecosystems

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models