xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address inefficient temporal modeling and excessive visual token redundancy—leading to high computational overhead—in video-based multimodal large language models (VLMs), this paper proposes a dual-path architecture comprising a “temporal encoder” and a “lightweight visual tokenizer.” The temporal encoder integrates learnable spatiotemporal pooling with Token Turing Machines to achieve structured, long-video temporal compression. The visual tokenizer maps frame sequences into merely 32 highly discriminative visual tokens. This design breaks the conventional VLM reliance on thousands of tokens: at just 4B parameters, our model matches the video question-answering performance of a 34B baseline while accelerating inference by over 10× and substantially reducing GPU memory consumption. Our core contribution is the first joint optimization of extreme video representation compression and efficient temporal modeling, enabling scalable, high-performance video understanding without sacrificing accuracy.

Technology Category

Application Category

📝 Abstract

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

Problem

Research questions and friction points this paper is trying to address.

Efficiently capture temporal information in videos

Reduce visual token count for video representation

Maintain accuracy with smaller, more efficient models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses temporal encoder for video token compression

Reduces tokens to 32 vs 4608 in competitors

Achieves high accuracy with smaller 4B model

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs