Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenges of long video understanding—namely, high computational costs and the limited adaptability of existing keyframe selection methods to diverse user queries. The authors propose Q-Gate, a novel framework that introduces, for the first time, a query-aware dynamic modality routing mechanism. It employs three lightweight expert streams to process visual details, scene semantics, and caption-based narratives, respectively, while a large language model (LLM)-driven gating mechanism dynamically allocates attention weights among them. Requiring no training, Q-Gate adapts seamlessly to varying query intents, significantly enhancing signal-to-noise ratio and interpretability. Built upon multimodal LLMs and integrating CLIP features, heuristic scoring, and context alignment strategies, the method implements a plug-in keyframe selection architecture that achieves state-of-the-art performance on LongVideoBench and Video-MME, demonstrating superior robustness and efficiency.

Technology Category

Application Category

📝 Abstract

Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, which select a keyframe subset, typically rely on either a single visual-centric metric (e.g., CLIP similarity) or a static fusion of heuristic scores. This ``one-size-fits-all'' paradigm frequently fails: visual-only metrics are ineffective for plot-driven narrative queries, while indiscriminately incorporating textual scores introduces severe ``modal noise'' for purely visual tasks. To break this bottleneck, we propose Q-Gate, a plug-and-play and training-free framework that treats keyframe selection as a dynamic modality routing problem. We decouple the retrieval process into three lightweight expert streams: Visual Grounding for local details, Global Matching for scene semantics, and Contextual Alignment for subtitle-driven narratives. Crucially, Q-Gate introduces a Query-Modulated Gating Mechanism that leverages the in-context reasoning of an LLM to assess the query's intent and dynamically allocate attention weights across the experts. This mechanism intelligently activates necessary modalities while ``muting'' irrelevant ones, thereby maximizing the signal-to-noise ratio. Extensive experiments on LongVideoBench and Video-MME across multiple MLLM backbones demonstrate that Q-Gate substantially outperforms state-of-the-art baselines. By effectively suppressing modality-specific noise, it provides a robust, highly interpretable solution for scalable video reasoning.

Problem

Research questions and friction points this paper is trying to address.

long video understanding

keyframe selection

multimodal noise

query-modulated

modality routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-Modulated Gating

Multimodal Keyframe Selection

Dynamic Modality Routing