Micro Language Models Enable Instant Responses

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the challenge of deploying large language models on resource-constrained edge devices, where limited compute and power budgets hinder on-device inference, while purely cloud-based approaches incur multi-second latencies that degrade user experience. To bridge this gap, the authors propose micro Language Models (μLMs)—compact models with 8M–30M parameters—that generate high-quality 4–8 word response prefixes directly on the device. These prefixes are then seamlessly extended by a powerful cloud-based model within an asymmetric edge-cloud collaborative generation framework. The approach introduces context-guided continuation and three structured error-correction mechanisms to enable smooth handover mid-sentence. Experiments demonstrate that μLMs match the prefix quality of significantly larger 70M–256M models, validating the feasibility of seamless edge-cloud collaboration across orders-of-magnitude differences in model scale and substantially reducing perceived latency.

Technology Category

Application Category

📝 Abstract
Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ($μ$LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that $μ$LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.
Problem

Research questions and friction points this paper is trying to address.

edge devices
language models
latency
resource constraints
on-device inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

micro language models
on-device inference
collaborative generation
latency masking
asymmetric model collaboration
🔎 Similar Papers
No similar papers found.