VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
📝 Abstract
Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

Video TextVQA
keyframe anchoring
evidence localization
multimodal reasoning
visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyframe Anchoring
Video TextVQA
Agentic Framework
Evidence Localization
Training-Free Inference
H
Haibin He
School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, Hubei, China
Maoyuan Ye
Maoyuan Ye
Wuhan University
CVOCRLLMMLLM
J
Jing Zhang
School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, Hubei, China
J
Juhua Liu
School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, Hubei, China
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain