xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

📅 2024-10-21
🏛️ arXiv.org
📈 Citations: 14
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficient temporal modeling and excessive visual token redundancy—leading to high computational overhead—in video-based multimodal large language models (VLMs), this paper proposes a dual-path architecture comprising a “temporal encoder” and a “lightweight visual tokenizer.” The temporal encoder integrates learnable spatiotemporal pooling with Token Turing Machines to achieve structured, long-video temporal compression. The visual tokenizer maps frame sequences into merely 32 highly discriminative visual tokens. This design breaks the conventional VLM reliance on thousands of tokens: at just 4B parameters, our model matches the video question-answering performance of a 34B baseline while accelerating inference by over 10× and substantially reducing GPU memory consumption. Our core contribution is the first joint optimization of extreme video representation compression and efficient temporal modeling, enabling scalable, high-performance video understanding without sacrificing accuracy.

Technology Category

Application Category

📝 Abstract
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html
Problem

Research questions and friction points this paper is trying to address.

Efficiently capture temporal information in videos
Reduce visual token count for video representation
Maintain accuracy with smaller, more efficient models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses temporal encoder for video token compression
Reduces tokens to 32 vs 4608 in competitors
Achieves high accuracy with smaller 4B model