🤖 AI Summary
This work addresses the challenge of excessive visual token sequences in long video understanding, which leads to prohibitive memory consumption and latency during inference, while existing compression methods struggle to balance query adaptivity and uneven temporal evidence distribution. To tackle this, the authors propose VideoRouter, a query-driven dual-routing framework built upon InternVL. It features a semantic router that dynamically selects between global coverage and high-resolution preservation strategies, and an image router that leverages early LLM-layer scores to assess frame relevance, thereby retaining fine-grained details in critical frames while efficiently compressing non-essential ones. Trained on Video-QTR-10K and Video-FLR-200K, VideoRouter achieves state-of-the-art performance on VideoMME, MLVU, and LongVideoBench, reducing token count by up to 67.9% under comparable or lower computational budgets.
📝 Abstract
Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.