MiniCPM4: Ultra-Efficient LLMs on End Devices

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

To address the challenge of efficiently deploying large language models (LLMs) on resource-constrained edge devices, this paper introduces the MiniCPM4 series (0.5B/8B), integrating several novel techniques: (1) InfLLM v2—a trainable sparse attention mechanism; (2) BitCPM, a ternary quantization scheme synergistically combined with INT4 compression; (3) chunk-wise reinforcement learning for sequence-level optimization; (4) UltraClean data cleaning and the high-quality UltraChat v2 fine-tuning dataset; (5) ModelTunnel v2, an automated pretraining architecture search framework; and (6) CPM.cu, a unified inference engine. The models preserve robust long-context modeling capability while substantially improving inference speed and energy efficiency. On mainstream benchmarks, MiniCPM4-8B outperforms open-source LLMs of comparable parameter count; it achieves faster long-sequence processing than Qwen3-8B. The framework has been successfully deployed in real-world edge applications, including trustworthy questionnaire generation and tool-augmented reasoning.

Technology Category

Application Category

📝 Abstract

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.

Problem

Research questions and friction points this paper is trying to address.

Develop ultra-efficient LLMs for end devices

Optimize model architecture, training data, and algorithms

Achieve high performance with minimal training tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

InfLLM v2 sparse attention for long-context

UltraClean and UltraChat v2 efficient datasets

CPM.cu integrates sparse attention and quantization

🔎 Similar Papers

No similar papers found.