Small Vision-Language Models are Smart Compressors for Long Video Understanding

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Long-form video understanding is hindered by the limited context window of multimodal large language models, where dense visual streams often saturate token budgets and lead to critical information loss. To address this, this work proposes Tempo, a framework that employs a compact vision-language model as a local temporal compressor, generating concise, query-aligned video representations in a single forward pass via query-aware cross-modal distillation. Tempo further introduces an untrained Adaptive Token Allocation (ATA) mechanism that dynamically allocates bandwidth based on zero-shot semantic relevance, prioritizing query-related segments while preserving global narrative coherence through minimal anchor points. Evaluated on LVBench—a 4101-second benchmark—Tempo achieves 52.3 points using only 8K visual tokens, surpassing GPT-4o and Gemini 1.5 Pro; performance further improves to 53.7 points when scaled to 2048 frames, substantially exceeding existing theoretical limits for long-video compression.

Technology Category

Application Category

📝 Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

Problem

Research questions and friction points this paper is trying to address.

long video understanding

context limits

token compression

multimodal large language models

video summarization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Small Vision-Language Model

Adaptive Token Allocation

Query-aware Compression