π€ AI Summary
Large language models (LLMs) face significant data efficiency bottlenecks in post-training, including prohibitively high human annotation costs and diminishing marginal returns from scaling training data. Method: This paper introduces the first systematic, data-centric taxonomy for LLM post-training, encompassing five technical pathways: data selection, quality enhancement, synthetic data generation, knowledge distillation-based compression, and self-evolving data ecosystems. It proposes a unified βdata-efficient post-trainingβ methodology, synthesizing representative works while identifying core challenges and open problems; designs an extensible data ecosystem architecture to enable synergistic optimization across techniques; and releases an open, structured literature repository. Contribution/Results: The framework establishes a novel paradigm for improving data utilization and reducing resource barriers in LLM post-training, offering both theoretical foundations and practical guidance for efficient model development.
π Abstract
Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM