Slow-Fast Architecture for Video Multi-Modal Large Language Models

πŸ“… 2025-04-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the bottleneck in video multimodal large language models (MLLMs)β€”where limited computational resources hinder simultaneous high temporal resolution and fine-grained spatial detailβ€”this paper proposes an instruction-aware slow-fast dual-path architecture. The fast path generates coarse global overview tokens, while the slow path employs a text-guided hybrid decoder for fine-grained visual feature extraction. Their synergy enables dynamic, instruction-driven feature selection and linear-complexity vision-language alignment. The architecture is plug-and-play, requiring no retraining to adapt existing video MLLMs. Experiments show that, with only a 3% increase in computational overhead, input frame count scales from 16 to 128, yielding an average 16% performance gain across five major video understanding benchmarks; the 7B variant achieves state-of-the-art results at its scale. The core innovation lies in the first integration of instruction awareness into slow-fast token design, enabling efficient and precise spatiotemporal joint modeling.

Technology Category

Application Category

πŸ“ Abstract
Balancing temporal resolution and spatial detail under limited compute budget remains a key challenge for video-based multi-modal large language models (MLLMs). Existing methods typically compress video representations using predefined rules before feeding them into the LLM, resulting in irreversible information loss and often ignoring input instructions. To address this, we propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Inspired by how humans first skim a video before focusing on relevant parts, our slow-fast design employs a dual-token strategy: 1)"fast"visual tokens -- a compact set of compressed video features -- are fed into the LLM alongside text embeddings to provide a quick overview; 2)"slow"visual tokens -- uncompressed video features -- are cross-attended by text embeddings through specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity. We conduct systematic exploration to optimize both the overall architecture and key components. Experiments show that our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation, and achieving a 16% average performance improvement across five video understanding benchmarks. Our 7B model achieves state-of-the-art performance among models of similar size. Furthermore, our slow-fast architecture is a plug-and-play design that can be integrated into other video MLLMs to improve efficiency and scalability.
Problem

Research questions and friction points this paper is trying to address.

Balancing temporal resolution and spatial detail in video MLLMs
Reducing irreversible information loss in video representations
Enhancing instruction-aware extraction of relevant visual details
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slow-fast dual-token strategy for video
Hybrid decoder layers for instruction-aware details
Plug-and-play design enhancing efficiency and scalability
πŸ”Ž Similar Papers
No similar papers found.