🤖 AI Summary
Long-form video understanding is hindered by the limited context window of multimodal large language models, where dense visual streams often saturate token budgets and lead to critical information loss. To address this, this work proposes Tempo, a framework that employs a compact vision-language model as a local temporal compressor, generating concise, query-aligned video representations in a single forward pass via query-aware cross-modal distillation. Tempo further introduces an untrained Adaptive Token Allocation (ATA) mechanism that dynamically allocates bandwidth based on zero-shot semantic relevance, prioritizing query-related segments while preserving global narrative coherence through minimal anchor points. Evaluated on LVBench—a 4101-second benchmark—Tempo achieves 52.3 points using only 8K visual tokens, surpassing GPT-4o and Gemini 1.5 Pro; performance further improves to 53.7 points when scaled to 2048 frames, substantially exceeding existing theoretical limits for long-video compression.
📝 Abstract
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.