VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address excessive flash I/O overhead when deploying large vision-language models (VLMs) on edge devices, this paper proposes an I/O-aware neuron chunk-wise sparsification method. Unlike conventional activation-magnitude-based neuron selection, our approach jointly models neuron importance and flash storage access cost, introducing a utility metric grounded in continuous-access latency estimation to guide chunk-level sparsification decisions. Key techniques include memory contiguity modeling, lightweight I/O latency abstraction, utility-normalized neuron filtering, and an efficient loading mechanism. Evaluated on Jetson Orin Nano and AGX Orin platforms, the method achieves 4.65× and 5.76× I/O efficiency improvements over centralized sparsity baselines, respectively. To our knowledge, this is the first work to deeply co-design activation sparsification with underlying flash access patterns, establishing a practical I/O optimization paradigm for large-model inference at the edge.

Technology Category

Application Category

📝 Abstract
Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.
Problem

Research questions and friction points this paper is trying to address.

Optimizing activation sparsification for flash-based VLM edge deployment
Aligning neuron selection with storage access patterns to reduce I/O
Improving I/O efficiency through chunk-based sparsification and latency modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Groups neurons into memory chunks for sparsification
Models I/O latency using access contiguity abstraction
Selects chunks based on utility balancing importance and latency
K
Kichang Yang
Seoul National University
S
Seonjun Kim
Seoul National University
M
Minjae Kim
Seoul National University
N
Nairan Zhang
Meta
C
Chi Zhang
Amazon
Youngki Lee
Youngki Lee
Seoul National University
Mobile and ubiquitous computing