🤖 AI Summary
To address inefficient temporal modeling and excessive visual token redundancy—leading to high computational overhead—in video-based multimodal large language models (VLMs), this paper proposes a dual-path architecture comprising a “temporal encoder” and a “lightweight visual tokenizer.” The temporal encoder integrates learnable spatiotemporal pooling with Token Turing Machines to achieve structured, long-video temporal compression. The visual tokenizer maps frame sequences into merely 32 highly discriminative visual tokens. This design breaks the conventional VLM reliance on thousands of tokens: at just 4B parameters, our model matches the video question-answering performance of a 34B baseline while accelerating inference by over 10× and substantially reducing GPU memory consumption. Our core contribution is the first joint optimization of extreme video representation compression and efficient temporal modeling, enabling scalable, high-performance video understanding without sacrificing accuracy.
📝 Abstract
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html