MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models on resource-constrained mobile devices presents a significant challenge in simultaneously achieving low latency, high compatibility, and strong model performance. This work proposes a hardware-in-the-loop neural architecture search method that co-designs model structure and attention patterns through latency-guided Pareto-front optimization, ensuring compatibility with standard mobile runtimes without requiring custom kernels. By integrating weight inheritance and an attention-skipping mechanism, we introduce the MobileLLM-Flash series of models (350M–1.4B parameters), which support 8K context lengths and achieve 1.8× faster prefill and 1.6× faster decoding speeds on mobile CPUs compared to prior approaches, while maintaining comparable or superior language generation quality.

Technology Category

Application Category

📝 Abstract
Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.
Problem

Research questions and friction points this paper is trying to address.

on-device LLM
latency constraints
mobile deployment
hardware compatibility
real-time AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-device LLM
latency-guided architecture search
attention skipping
hardware-in-the-loop
Pareto-frontier optimization
🔎 Similar Papers
No similar papers found.