🤖 AI Summary
To address high latency caused by redundant model loading and fragmented execution across multi-stage mobile video–language tasks (e.g., retrieval, captioning, reasoning), this work proposes a modular reuse architecture. It decomposes billion-parameter multimodal models into lightweight, task-agnostic components—such as shared visual encoders and language decoders—that enable zero-redundancy loading and parallel execution. The method integrates modular decomposition, cross-task parameter sharing, lightweight runtime scheduling, and on-device storage–computation co-optimization. Evaluated on mainstream smartphones, it achieves 27–33% end-to-end acceleration with negligible accuracy degradation (Recall@1 ↓ ≤2.3%, CIDEr ↓ ≤1.5%). This work establishes the first systematic lightweight paradigm for multi-stage multimodal inference on mobile devices, significantly improving the performance–efficiency trade-off.
📝 Abstract
Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($leq$ 2.3 Recall@1 in retrieval, $leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.