Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high end-to-end latency and poor energy efficiency of large multimodal models (LMMs) executing holistically on heterogeneous SoCs in edge devices, this work proposes a hardware-software co-designed modular inference framework. Our approach decomposes LMMs into independently schedulable “building-block” submodules, dynamically offloading them to NPU/GPU/DSP based on computational capability and power constraints. We further introduce token-aware caching and CPU bottleneck avoidance mechanisms, integrated with low-bit kernels, a unified memory architecture, and system-level scheduling policies. Experimental results demonstrate a 42.3% reduction in energy consumption and an 11.2% decrease in GPU VRAM usage. On network-isolated micro-edge devices, the framework enables LLaVA-OneVision to operate for nearly 12 hours and supports LLaMA-3-8B speech interaction for up to 20.8 hours—marking significant advances in efficient, long-duration multimodal inference at the edge.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks''(vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3% and GPU memory usage by 11.2%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly half a day and LLaMA-3-8B for voice interactions up to almost 20.8 hours.
Problem

Research questions and friction points this paper is trying to address.

Optimizing multimodal model execution on resource-constrained small devices
Reducing energy consumption and memory usage through hardware-software co-design
Enabling battery-powered devices to run large models without connectivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular model decomposition into specialized bricks
Dynamic accelerator mapping for heterogeneous hardware
Token-aware buffer management reducing memory usage
Yilong Li
Yilong Li
PhD, Stanford University
operating systemsdistributed systemsdatacenter computingnetworking
S
Shuai Zhang
Amazon Web Services AI, USA
Y
Yijing Zeng
University of Wisconsin – Madison
H
Hao Zhang
University of Wisconsin – Madison
X
Xinmiao Xiong
University of Wisconsin – Madison
J
Jingyu Liu
University of Wisconsin – Madison
P
Pan Hu
Uber, USA
Suman Banerjee
Suman Banerjee
Department of CSE, IIT Jammu
Algorithmic Data ManagementSocial Network AnalysisGraph Theory and Graph AlgorithmsParameterized Complexity