Agentic Keyframe Search for Video Question Answering

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Video question answering (VideoQA) suffers from redundant keyframes and high computational overhead. This paper proposes a language-agent-guided dynamic keyframe search framework that, for the first time, employs a large language model (LLM) as a reasoning agent to drive A* heuristic search over a tree-structured video segmentation hierarchy, enabling semantic-aware, adaptive keyframe selection and early termination. The method achieves frame-level sparsification: on the EgoSchema subset, it improves accuracy by 1.8% while processing only 43.5% of original frames—outperforming VideoTree with significantly reduced computation; it also achieves state-of-the-art performance on NExT-QA. The core contributions are (i) the principled migration of classical search paradigms—specifically A*—to LLM-driven semantic reasoning, and (ii) scalable semantic tree modeling of videos. This work establishes a new pathway toward efficient, interpretable VideoQA.

Technology Category

Application Category

📝 Abstract

Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.

Problem

Research questions and friction points this paper is trying to address.

Improves video question answering efficiency

Reduces computational costs in VideoQA

Enhances keyframe identification accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Keyframe Search algorithm for VideoQA

Tree-structured video segmentation for keyframe identification

Language agent-driven heuristic and cost estimation

🔎 Similar Papers

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA