PCMind-2.1-Kaiyuan-2B Technical Report

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A significant gap exists between open-source communities and industry in large language model (LLM) pretraining, primarily due to the inaccessibility of high-quality proprietary data and advanced training methodologies. To bridge this gap, we propose a low-resource-efficient pretraining framework comprising: (1) a novel quantile-based data benchmarking method for cross-source, quantitative data quality assessment; (2) a multi-stage selective resampling mechanism to enhance utilization of sparse high-quality samples; and (3) a multi-domain curriculum learning strategy that dynamically schedules training samples according to their quality-ranked order. Integrated with FP16-stable architecture, quantized data pipelines, and efficient preprocessing, our released Kaiyuan-2B model achieves state-of-the-art performance among fully open-source 2B-parameter models. All model weights, curated datasets, and training code are publicly released under the Apache 2.0 license, enabling reproducible and scalable low-resource pretraining.

Technology Category

Application Category

📝 Abstract
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
Problem

Research questions and friction points this paper is trying to address.

Addresses the knowledge gap between open-source and industry LLMs
Improves training efficiency and effectiveness under resource constraints
Provides scalable solutions for resource-limited pretraining with open assets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantile Data Benchmarking method for comparing datasets
Strategic Selective Repetition scheme to leverage sparse data
Multi-Domain Curriculum Training policy ordering samples by quality
🔎 Similar Papers
No similar papers found.
K
Kairong Luo
Tsinghua University
Z
Zhenbo Sun
Tsinghua University
X
Xinyu Shi
Tsinghua University
S
Shengqi Chen
Tsinghua University
Bowen Yu
Bowen Yu
Qwen Team, Alibaba Group
Post-trainingFoundation Model
Yunyi Chen
Yunyi Chen
Tsinghua University
MPC
C
Chenyi Dang
Tsinghua University
H
Hengtao Tao
Peng Cheng Laboratory
H
Hui Wang
Peng Cheng Laboratory
Fangming Liu
Fangming Liu
Professor, School of Computer Science & Technology, Huazhong University of Science & Technology
AI & Cloud ComputingDatacenterLLM SystemEdge ComputingGreen Computing
Kaifeng Lyu
Kaifeng Lyu
Tsinghua University
W
Wenguang Chen
Tsinghua University, Peng Cheng Laboratory