🤖 AI Summary
To address the challenge of efficiently deploying large language models (LLMs) on resource-constrained edge devices, this paper introduces the MiniCPM4 series (0.5B/8B), integrating several novel techniques: (1) InfLLM v2—a trainable sparse attention mechanism; (2) BitCPM, a ternary quantization scheme synergistically combined with INT4 compression; (3) chunk-wise reinforcement learning for sequence-level optimization; (4) UltraClean data cleaning and the high-quality UltraChat v2 fine-tuning dataset; (5) ModelTunnel v2, an automated pretraining architecture search framework; and (6) CPM.cu, a unified inference engine. The models preserve robust long-context modeling capability while substantially improving inference speed and energy efficiency. On mainstream benchmarks, MiniCPM4-8B outperforms open-source LLMs of comparable parameter count; it achieves faster long-sequence processing than Qwen3-8B. The framework has been successfully deployed in real-world edge applications, including trustworthy questionnaire generation and tool-augmented reasoning.
📝 Abstract
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.