MobileLLM-Pro Technical Report

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the joint challenges of low latency, long-context support, and high accuracy in deploying billion-parameter language models on mobile devices, this paper proposes an end-to-end efficient deployment framework. We introduce implicit positional distillation to preserve long-range dependency modeling; design an expert model fusion mechanism to enhance parameter efficiency; develop a utility-estimated data mixing strategy to optimize training distribution; and propose the first 4-bit quantization-aware self-distillation training method. Evaluated across 11 standard benchmarks, our approach consistently outperforms Gemma-3B and Llama-3.2-1B. It supports contexts up to 128K tokens and sustains near-lossless performance under 4-bit quantization—average degradation <0.5%—thereby significantly advancing the practical deployability of billion-scale models on edge devices.

Technology Category

Application Category

📝 Abstract
Efficient on-device language models around 1 billion parameters are essential for powering low-latency AI applications on mobile and wearable devices. However, achieving strong performance in this model class, while supporting long context windows and practical deployment remains a significant challenge. We introduce MobileLLM-Pro, a 1-billion-parameter language model optimized for on-device deployment. MobileLLM-Pro achieves state-of-the-art results across 11 standard benchmarks, significantly outperforming both Gemma 3-1B and Llama 3.2-1B, while supporting context windows of up to 128,000 tokens and showing only minor performance regressions at 4-bit quantization. These improvements are enabled by four core innovations: (1) implicit positional distillation, a novel technique that effectively instills long-context capabilities through knowledge distillation; (2) a specialist model merging framework that fuses multiple domain experts into a compact model without parameter growth; (3) simulation-driven data mixing using utility estimation; and (4) 4-bit quantization-aware training with self-distillation. We release our model weights and code to support future research in efficient on-device language models.
Problem

Research questions and friction points this paper is trying to address.

Optimizing 1B parameter models for mobile deployment
Achieving strong performance with long context windows
Maintaining accuracy under 4-bit quantization constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit positional distillation for long-context capabilities
Specialist model merging framework without parameter growth
4-bit quantization-aware training with self-distillation
🔎 Similar Papers
No similar papers found.