MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

📅 2024-12-03

🏛️ Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops

📈 Citations: 1

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Deploying large language models (LLMs) on edge devices—such as smartphones—is hindered by high memory footprint and slow inference latency. To address these bottlenecks, we propose an on-device efficient inference framework featuring a novel DRAM-Flash hybrid memory architecture and a mobile CPU/GPU-coordinated dynamic weight-input reordering strategy. Our method tightly integrates multiple optimization techniques: post-training quantization, mixed-precision floating-point computation, multi-core load balancing, geometric computation optimization, and instruction-set-aware weight layout customization. These synergistic optimizations significantly improve hardware utilization and computational efficiency. Experimental results demonstrate up to 8.6× speedup over state-of-the-art LLM inference frameworks, alongside substantial memory reduction. The framework enables real-time execution of mainstream open-weight models—including LLaMA and Phi—on both Android and iOS platforms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to enhance performance. Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory usage for LLM deployment on mobile devices

Accelerating inference speed for LLMs on edge devices

Optimizing mobile CPU and GPU performance for LLM execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model quantization and DRAM-Flash hybrid storage

Rearranges weights and inputs for mobile CPUs/GPUs

Multicore load balancing and mixed-precision operations

🔎 Similar Papers

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices