🤖 AI Summary
To address memory bottlenecks hindering large language model (LLM) inference on resource-constrained edge devices, this paper proposes a flexible and efficient on-device memory offloading framework. Methodologically, it introduces three novel mechanisms: asynchronous GPU-to-CPU tensor prefetching, hierarchical memory locking, and elastic tensor lifetime management, integrated within a lightweight runtime scheduler to jointly optimize memory efficiency and I/O performance. The framework dynamically adapts to user-specified hardware constraints—enabling successful deployment of 7B–13B LLMs on edge devices with only 4GB of RAM. Experimental results demonstrate up to a 12.5× improvement in inference throughput over state-of-the-art offloading approaches, while simultaneously achieving low latency and strong adaptability across diverse hardware configurations.
📝 Abstract
Large Language Models (LLMs) face challenges for on-device inference due to high memory demands. Traditional methods to reduce memory usage often compromise performance and lack adaptability. We propose FlexInfer, an optimized offloading framework for on-device inference, addressing these issues with techniques like asynchronous prefetching, balanced memory locking, and flexible tensor preservation. These strategies enhance memory efficiency and mitigate I/O bottlenecks, ensuring high performance within user-specified resource constraints. Experiments demonstrate that FlexInfer significantly improves throughput under limited resources, achieving up to 12.5 times better performance than existing methods and facilitating the deployment of large models on resource-constrained devices.