See What You Need: Query-Aware Visual Intelligence through Reasoning-Perception Loops

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-video question answering, existing methods rigidly decouple reasoning from perception, leading to premature visual abstraction—either discarding critical information or introducing computational redundancy—thus hindering query-adaptive extraction of salient visual evidence. This paper proposes a dynamic closed-loop coordination framework that, for the first time, enables adaptive reasoning-guided perception. It establishes a multi-round reasoning-perception interaction loop via hierarchical reasoning-driven visual localization, cross-modal semantic bridging, and confidence-guided iterative synthesis. Crucially, the framework is training-free, integrating semantic alignment with dynamic frame selection. Evaluated on EgoSchema (65.7%), NExT-QA (76.1%), and IntentQA (73.8%), it achieves state-of-the-art performance, significantly improving both accuracy and efficiency in long-video understanding.

Technology Category

Application Category

📝 Abstract
Human video comprehension demonstrates dynamic coordination between reasoning and visual attention, adaptively focusing on query-relevant details. However, current long-form video question answering systems employ rigid pipelines that decouple reasoning from perception, leading to either information loss through premature visual abstraction or computational inefficiency through exhaustive processing. The core limitation lies in the inability to adapt visual extraction to specific reasoning requirements, different queries demand fundamentally different visual evidence from the same video content. In this work, we present CAVIA, a training-free framework that revolutionizes video understanding through reasoning, perception coordination. Unlike conventional approaches where visual processing operates independently of reasoning, CAVIA creates a closed-loop system where reasoning continuously guides visual extraction based on identified information gaps. CAVIA introduces three innovations: (1) hierarchical reasoning, guided localization to precise frames; (2) cross-modal semantic bridging for targeted extraction; (3) confidence-driven iterative synthesis. CAVIA achieves state-of-the-art performance on challenging benchmarks: EgoSchema (65.7%, +5.3%), NExT-QA (76.1%, +2.6%), and IntentQA (73.8%, +6.9%), demonstrating that dynamic reasoning-perception coordination provides a scalable paradigm for video understanding.
Problem

Research questions and friction points this paper is trying to address.

Adapting visual extraction to specific reasoning requirements
Overcoming rigid decoupling of reasoning and perception
Enabling query-aware dynamic visual attention in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-perception closed-loop coordination system
Hierarchical reasoning for precise frame localization
Confidence-driven iterative cross-modal synthesis
🔎 Similar Papers
No similar papers found.
Zixuan Dong
Zixuan Dong
New York University
Reinforcement LearningDeep LearningNeural Collapse
Baoyun Peng
Baoyun Peng
Academy of Military Science
Multimodal understandingAutonomous drivingKnowledge GraphNatural Language Processing
Y
Yufei Wang
College of Computer, National University of Defense Technology
L
Lin Liu
College of Computer, National University of Defense Technology
X
Xinxin Dong
College of Computer, National University of Defense Technology
Y
Yunlong Cao
College of Computer, National University of Defense Technology
X
Xiaodong Wang
College of Computer, National University of Defense Technology