FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address memory bottlenecks hindering large language model (LLM) inference on resource-constrained edge devices, this paper proposes a flexible and efficient on-device memory offloading framework. Methodologically, it introduces three novel mechanisms: asynchronous GPU-to-CPU tensor prefetching, hierarchical memory locking, and elastic tensor lifetime management, integrated within a lightweight runtime scheduler to jointly optimize memory efficiency and I/O performance. The framework dynamically adapts to user-specified hardware constraints—enabling successful deployment of 7B–13B LLMs on edge devices with only 4GB of RAM. Experimental results demonstrate up to a 12.5× improvement in inference throughput over state-of-the-art offloading approaches, while simultaneously achieving low latency and strong adaptability across diverse hardware configurations.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) face challenges for on-device inference due to high memory demands. Traditional methods to reduce memory usage often compromise performance and lack adaptability. We propose FlexInfer, an optimized offloading framework for on-device inference, addressing these issues with techniques like asynchronous prefetching, balanced memory locking, and flexible tensor preservation. These strategies enhance memory efficiency and mitigate I/O bottlenecks, ensuring high performance within user-specified resource constraints. Experiments demonstrate that FlexInfer significantly improves throughput under limited resources, achieving up to 12.5 times better performance than existing methods and facilitating the deployment of large models on resource-constrained devices.

Problem

Research questions and friction points this paper is trying to address.

Addresses high memory demands for on-device LLM inference.

Improves memory efficiency and mitigates I/O bottlenecks.

Enables deployment of large models on resource-constrained devices.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous prefetching enhances memory efficiency.

Balanced memory locking mitigates I/O bottlenecks.

Flexible tensor preservation ensures high performance.

🔎 Similar Papers

Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation