SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing model efficiency and performance for lightweight video large language models (V-LLMs) in long-video understanding—particularly for mobile deployment—this paper introduces the first lightweight V-LLM family (1B–3B parameters) integrating a SlowFast dual-stream visual encoder. Methodologically, we innovatively incorporate the SlowFast architecture into the V-LLM training pipeline, augmented by multi-stage alignment, joint video-image fine-tuning, and LLaVA-1.5–inspired lightweight adaptation. Our contributions are threefold: (1) We demonstrate, for the first time, that a 1B-parameter model achieves state-of-the-art performance on long-video benchmarks including LongVideoBench and MLVU; (2) it significantly improves token efficiency and inference speed, enabling practical mobile deployment; and (3) both the 1B and 3B variants consistently outperform comparable-scale competitors, while the 7B variant exhibits robust cross-modal generalization.

Technology Category

Application Category

📝 Abstract
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Token-efficient long-form video understanding models
Lightweight mobile-friendly Video LLMs development
Competitive performance across video benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stream SlowFast mechanism for efficiency
Token-efficient long-form video understanding
Streamlined training with high-quality data
🔎 Similar Papers
No similar papers found.