🤖 AI Summary
A significant gap exists between open-source communities and industry in large language model (LLM) pretraining, primarily due to the inaccessibility of high-quality proprietary data and advanced training methodologies. To bridge this gap, we propose a low-resource-efficient pretraining framework comprising: (1) a novel quantile-based data benchmarking method for cross-source, quantitative data quality assessment; (2) a multi-stage selective resampling mechanism to enhance utilization of sparse high-quality samples; and (3) a multi-domain curriculum learning strategy that dynamically schedules training samples according to their quality-ranked order. Integrated with FP16-stable architecture, quantized data pipelines, and efficient preprocessing, our released Kaiyuan-2B model achieves state-of-the-art performance among fully open-source 2B-parameter models. All model weights, curated datasets, and training code are publicly released under the Apache 2.0 license, enabling reproducible and scalable low-resource pretraining.
📝 Abstract
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.