Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency caused by redundant model loading and fragmented execution across multi-stage mobile video–language tasks (e.g., retrieval, captioning, reasoning), this work proposes a modular reuse architecture. It decomposes billion-parameter multimodal models into lightweight, task-agnostic components—such as shared visual encoders and language decoders—that enable zero-redundancy loading and parallel execution. The method integrates modular decomposition, cross-task parameter sharing, lightweight runtime scheduling, and on-device storage–computation co-optimization. Evaluated on mainstream smartphones, it achieves 27–33% end-to-end acceleration with negligible accuracy degradation (Recall@1 ↓ ≤2.3%, CIDEr ↓ ≤1.5%). This work establishes the first systematic lightweight paradigm for multi-stage multimodal inference on mobile devices, significantly improving the performance–efficiency trade-off.

Technology Category

Application Category

📝 Abstract
Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($leq$ 2.3 Recall@1 in retrieval, $leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.
Problem

Research questions and friction points this paper is trying to address.

Reduces redundant model loading in on-device video-language pipelines
Enables modular reuse across subtasks like captioning and reasoning
Minimizes execution latency while maintaining performance on mobile devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular reuse of billion-parameter model components
Eliminates repeated model loading through shared modules
Enables parallel execution to reduce latency on devices
🔎 Similar Papers
No similar papers found.