Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory overhead, redundant computation, and poor scalability caused by full-frame encoding in long-video understanding, this paper proposes a query-driven “reasoning-first, perception-second” paradigm that establishes a dynamic perception–reasoning closed loop. Guided by natural language queries, a multimodal large language model performs semantic-adaptive frame sampling; query-conditioned attention and feedback-based spatiotemporal alignment modules are introduced to enable lightweight visual encoding exclusively over critical spatiotemporal regions. Evaluated on five video question-answering benchmarks—including MSVD-QA—the method achieves state-of-the-art performance while reducing average frame input by 52%–73%. This significantly improves inference efficiency and deployment scalability. Notably, it is the first approach to jointly optimize query-guided temporal reasoning and sparse perception, enabling efficient, adaptive, and scalable long-video understanding.

Technology Category

Application Category

📝 Abstract
The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.
Problem

Research questions and friction points this paper is trying to address.

Addresses computational inefficiency in long-video understanding with MLLMs
Reduces excessive visual tokens and memory consumption from dense frame encoding
Replaces exhaustive analysis with query-guided adaptive perception for scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-driven adaptive perception for video understanding
Dynamic frame allocation reduces computational overhead
Lightweight framework achieves high performance with fewer frames
🔎 Similar Papers
No similar papers found.
X
Xinkui Zhao
School of Software Technology, Zhejiang University, Hangzhou, China
Z
Zuxin Wang
School of Software Technology, Zhejiang University, Hangzhou, China
Y
Yifan Zhang
School of Software Technology, Zhejiang University, Hangzhou, China
Guanjie Cheng
Guanjie Cheng
Assistant Professor, School of Software Technology, Zhejiang University
AIoTMuti-Agent CollaborationEdge ComputingData Security and BlockchainPrivacy Protection
Yueshen Xu
Yueshen Xu
Xidian University; Zhejiang University; UIC
Service ComputingSoftware EngineeringSoftware Service EngineeringEdge Computing
S
Shuiguang Deng
School of Computer Science, Zhejiang University, Hangzhou, China
C
Chang Liu
School of Computer Science, Zhejiang University, Hangzhou, China
N
Naibo Wang
School of Computer Science, Zhejiang University, Hangzhou, China
Jianwei Yin
Jianwei Yin
Professor of Computer Science and Technology, Zhejiang University
Service ComputingComputer ArchitectureDistributed ComputingAI