🤖 AI Summary
Deploying large language models (LLMs) on edge devices—such as smartphones—is hindered by high memory footprint and slow inference latency. To address these bottlenecks, we propose an on-device efficient inference framework featuring a novel DRAM-Flash hybrid memory architecture and a mobile CPU/GPU-coordinated dynamic weight-input reordering strategy. Our method tightly integrates multiple optimization techniques: post-training quantization, mixed-precision floating-point computation, multi-core load balancing, geometric computation optimization, and instruction-set-aware weight layout customization. These synergistic optimizations significantly improve hardware utilization and computational efficiency. Experimental results demonstrate up to 8.6× speedup over state-of-the-art LLM inference frameworks, alongside substantial memory reduction. The framework enables real-time execution of mainstream open-weight models—including LLaMA and Phi—on both Android and iOS platforms.
📝 Abstract
Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to enhance performance. Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.