Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

📅 2024-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding models rely heavily on large pre-trained image/video encoders, resulting in high computational overhead, substantial energy consumption, and slow inference. To address this, we propose the first encoder-free lightweight video–language understanding architecture, which directly processes raw video frames using only a 45M-parameter spatiotemporal alignment block (STAB). STAB integrates local spatiotemporal encoding, attention-guided spatial downsampling, and hierarchical temporal relation modeling—eliminating the need for external visual encoders. Compared to state-of-the-art methods, our approach reduces parameter count by over 6.5×. On open-domain video question answering benchmarks, it matches or surpasses Video-ChatGPT and Video-LLaVA in overall performance, demonstrates superior temporal reasoning capability, and achieves 3–4× faster inference. Ablation studies comprehensively validate both the efficacy of the encoder-free paradigm and the design principles of STAB.

Technology Category

Application Category

📝 Abstract
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5$ imes$ reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4$ imes$ faster processing speeds than previous methods. Code is available at url{https://github.com/jh-yi/Video-Panda}.
Problem

Research questions and friction points this paper is trying to address.

Video Understanding
Computational Cost
Energy Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoPanda
STAB
Efficient Video Understanding
🔎 Similar Papers
No similar papers found.