Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the challenge of deploying large language models (LLMs) on mobile devices constrained by limited DRAM capacity, this paper proposes the first adaptive memory scheduling framework supporting modern non-ReLU LLMs. Our method introduces three core innovations: (1) proactive weight restructuring coupled with dynamic DRAM-Flash swapping; (2) cross-layer preloading, sparsity-aware self-distillation, and dynamic memory orchestration; and (3) an integrated pipeline combining proactive weight prediction, context-aware sparsity modeling, self-distillation calibration, and heterogeneous storage pipelining. Experiments demonstrate that our approach maintains inference accuracy within 1% of the full-parameter baseline while reducing DRAM footprint by over 40%, thereby significantly expanding the feasible scale of deployable LLMs on resource-constrained devices. The solution achieves Pareto-optimal trade-offs between inference performance and memory cost.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight DRAM-flash swapping and incorporates three novel techniques: (1) Cross-layer active weights preloading. It uses the activations from the current layer to predict the active weights of several subsequent layers, enabling computation and data loading to overlap, as well as facilitating large I/O transfers. (2) Sparsity-aware self-distillation. It adjusts the active weights to align with the dense-model output distribution, compensating for approximations introduced by contextual sparsity. (3) Active weight DRAM-flash swapping pipeline. It orchestrates the DRAM space allocation among the hot weight cache, preloaded active weights, and computation-involved weights based on available memory. Results show ActiveFlow achieves the performance-cost Pareto frontier compared to existing efficiency optimization methods.

Problem

Research questions and friction points this paper is trying to address.

Overcoming DRAM capacity limits for on-device LLM deployment

Enabling adaptive DRAM usage for modern LLM inference

Optimizing performance-cost tradeoff in mobile LLM frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer active weights preloading for overlapping computation

Sparsity-aware self-distillation aligns active weights

Active weight DRAM-flash swapping pipeline optimizes memory

🔎 Similar Papers

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices