YuLan-Mini: An Open Data-efficient Language Model

📅 2024-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and low data efficiency in large language model (LLM) training, this work introduces YuLan-Mini—a lightweight, efficient model with 2.42 billion parameters. Methodologically, we propose three key innovations: (1) a fine-grained data cleaning and dynamic scheduling pipeline; (2) a robust adaptive optimization algorithm to enhance training stability; and (3) a curriculum learning–inspired annealing strategy that integrates goal-directed data selection with long-context pretraining. Trained on only 1.08 trillion tokens, YuLan-Mini achieves state-of-the-art performance among models of comparable scale and matches leading industry models across multiple benchmarks. Crucially, the entire training dataset is publicly released, enabling full reproducibility. This work establishes a systematic, data-efficient methodology for LLM training and provides an open-source reference implementation, advancing both technical rigor and transparency in foundation model development.

Technology Category

Application Category

📝 Abstract
Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Resource Demand
Complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

data thriftiness
stabilization technique
selective data utilization
🔎 Similar Papers
No similar papers found.