PCMind-2.1-Kaiyuan-2B Technical Report

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

249K/year

🤖 AI Summary

A significant gap exists between open-source communities and industry in large language model (LLM) pretraining, primarily due to the inaccessibility of high-quality proprietary data and advanced training methodologies. To bridge this gap, we propose a low-resource-efficient pretraining framework comprising: (1) a novel quantile-based data benchmarking method for cross-source, quantitative data quality assessment; (2) a multi-stage selective resampling mechanism to enhance utilization of sparse high-quality samples; and (3) a multi-domain curriculum learning strategy that dynamically schedules training samples according to their quality-ranked order. Integrated with FP16-stable architecture, quantized data pipelines, and efficient preprocessing, our released Kaiyuan-2B model achieves state-of-the-art performance among fully open-source 2B-parameter models. All model weights, curated datasets, and training code are publicly released under the Apache 2.0 license, enabling reproducible and scalable low-resource pretraining.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

Problem

Research questions and friction points this paper is trying to address.

Addresses the knowledge gap between open-source and industry LLMs

Improves training efficiency and effectiveness under resource constraints

Provides scalable solutions for resource-limited pretraining with open assets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantile Data Benchmarking method for comparing datasets

Strategic Selective Repetition scheme to leverage sparse data

Multi-Domain Curriculum Training policy ordering samples by quality

🔎 Similar Papers

No similar papers found.