Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address high latency caused by redundant model loading and fragmented execution across multi-stage mobile video–language tasks (e.g., retrieval, captioning, reasoning), this work proposes a modular reuse architecture. It decomposes billion-parameter multimodal models into lightweight, task-agnostic components—such as shared visual encoders and language decoders—that enable zero-redundancy loading and parallel execution. The method integrates modular decomposition, cross-task parameter sharing, lightweight runtime scheduling, and on-device storage–computation co-optimization. Evaluated on mainstream smartphones, it achieves 27–33% end-to-end acceleration with negligible accuracy degradation (Recall@1 ↓ ≤2.3%, CIDEr ↓ ≤1.5%). This work establishes the first systematic lightweight paradigm for multi-stage multimodal inference on mobile devices, significantly improving the performance–efficiency trade-off.

Technology Category

Application Category

📝 Abstract

Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($leq$ 2.3 Recall@1 in retrieval, $leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundant model loading in on-device video-language pipelines

Enables modular reuse across subtasks like captioning and reasoning

Minimizes execution latency while maintaining performance on mobile devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular reuse of billion-parameter model components

Eliminates repeated model loading through shared modules

Enables parallel execution to reduce latency on devices

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs