🤖 AI Summary
To address the joint challenges of low latency, long-context support, and high accuracy in deploying billion-parameter language models on mobile devices, this paper proposes an end-to-end efficient deployment framework. We introduce implicit positional distillation to preserve long-range dependency modeling; design an expert model fusion mechanism to enhance parameter efficiency; develop a utility-estimated data mixing strategy to optimize training distribution; and propose the first 4-bit quantization-aware self-distillation training method. Evaluated across 11 standard benchmarks, our approach consistently outperforms Gemma-3B and Llama-3.2-1B. It supports contexts up to 128K tokens and sustains near-lossless performance under 4-bit quantization—average degradation <0.5%—thereby significantly advancing the practical deployability of billion-scale models on edge devices.
📝 Abstract
Efficient on-device language models around 1 billion parameters are essential for powering low-latency AI applications on mobile and wearable devices. However, achieving strong performance in this model class, while supporting long context windows and practical deployment remains a significant challenge. We introduce MobileLLM-Pro, a 1-billion-parameter language model optimized for on-device deployment. MobileLLM-Pro achieves state-of-the-art results across 11 standard benchmarks, significantly outperforming both Gemma 3-1B and Llama 3.2-1B, while supporting context windows of up to 128,000 tokens and showing only minor performance regressions at 4-bit quantization. These improvements are enabled by four core innovations: (1) implicit positional distillation, a novel technique that effectively instills long-context capabilities through knowledge distillation; (2) a specialist model merging framework that fuses multiple domain experts into a compact model without parameter growth; (3) simulation-driven data mixing using utility estimation; and (4) 4-bit quantization-aware training with self-distillation. We release our model weights and code to support future research in efficient on-device language models.