Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address memory and computational bottlenecks caused by token overload in long-video question answering, this paper proposes a query-adaptive framework for selecting static and dynamic frame tokens. Unlike generic compression strategies that ignore query-specific semantic requirements, we introduce the EXPLORE-THEN-SELECT framework: it explicitly models the dependency of both static visual details and dynamic motion patterns on the input question via query-frame attention, enabling fine-tuning-free, query-aware two-stage token allocation. The method adopts a lightweight, plug-and-play architecture compatible with mainstream video-language models. Evaluated on multiple benchmarks, our approach achieves up to a 5.8% absolute improvement in QA accuracy, significantly enhancing both efficiency and precision in long-video understanding.

Technology Category

Application Category

📝 Abstract

Video question answering benefits from the rich information available in videos, enabling a wide range of applications. However, the large volume of tokens generated from longer videos presents significant challenges to memory efficiency and model performance. To alleviate this issue, existing works propose to compress video inputs, but usually overlooking the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. To tackle this, we propose a novel token selection strategy, EXPLORE-THEN-SELECT, that adaptively adjust static and dynamic information needed based on question requirements. Our framework first explores different token allocations between static frames, which preserve spatial details, and dynamic frames, which capture temporal changes. Next, it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our proposed framework is plug-and-play that can be seamlessly integrated within diverse video-language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8%) among various video question answering benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Adaptive token selection for video QA based on query needs

Balancing static and dynamic video information efficiently

Improving memory efficiency and model performance in video QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive token selection based on question requirements

Query-aware attention-based metric for optimal tokens

Plug-and-play integration with video-language models

🔎 Similar Papers

Frame-Voyager: Learning to Query Frames for Video Large Language Models