Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency and low hardware utilization in context-aware generation by large language models (LLMs) on mobile devices—caused by autoregressive decoding—this paper proposes the first co-optimization framework tailored for mobile platforms. Our method innovatively integrates speculative decoding with dynamic hardware scheduling: we design a context-aligned lightweight draft model, an adaptive computation graph scheduler, online task calibration, and intermediate state reuse to enable parallelized generation and efficient resource utilization. Experiments across multiple smartphones and representative workloads demonstrate up to 3.8× end-to-end generation speedup and 4.7× energy efficiency improvement. Ablation studies quantitatively attribute performance gains to individual components. This work establishes a systematic solution for efficient, context-aware LLM inference under severe resource constraints.

Technology Category

Application Category

📝 Abstract
Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents CoordGen, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.
Problem

Research questions and friction points this paper is trying to address.

Accelerating slow token-by-token generation on mobile LLMs
Improving hardware utilization for memory-bound mobile inference
Enhancing context-aware text generation efficiency on devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid speculative decoding with dynamic hardware scheduling
Adaptive execution scheduling balancing prefill and decoding
Context-aligned drafting with lightweight online calibration
🔎 Similar Papers
No similar papers found.