VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-video question answering faces core challenges including difficulty in associating multi-source videos and complex spatiotemporal reasoning. To address these, we propose a person-centric hierarchical reasoning framework that constructs a multi-granularity generation tree and a multi-agent inference mechanism—enabling cross-video spatiotemporal relationship modeling without end-to-end training. Our method integrates person re-identification (ReID), object tracking, hierarchical feature organization, and collaborative multi-agent reasoning to yield an interpretable cross-video spatiotemporal relational network. Evaluated on our newly established benchmark CrossVideoQA, our approach achieves 71.93% accuracy in person identification, 83.75% in action analysis, and 51.67% in summarization and reasoning—substantially outperforming existing methods. This work pioneers the use of persons as the central semantic anchor for cross-video alignment, establishing a scalable and interpretable paradigm for open-domain video understanding.

Technology Category

Application Category

📝 Abstract
Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning. Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross-video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest's superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross-video understanding by unifying multiple video streams through person-level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enables cross-video understanding via person-level features
Organizes multi-source video content hierarchically for reasoning
Improves accuracy in person recognition and behavior analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-anchored feature extraction using ReID and tracking
Multi-granularity spanning tree for hierarchical organization
Multi-agent framework for efficient hierarchical reasoning
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
Y
Yiran Meng
Sun Yat-Sen University, Zhuhai, China
J
Junhong Ye
Sun Yat-Sen University, Zhuhai, China
W
Wei Zhou
Cardiff University, United Kingdom
G
Guanghui Yue
Shenzhen University, Shenzhen, China
Xudong Mao
Xudong Mao
Sun Yat-sen University
Computer VisionDeep Learning
R
Ruomei Wang
Sun Yat-Sen University, Guangzhou, China
Baoquan Zhao
Baoquan Zhao
Sun Yat-sen University
3D point cloud processing and compressionMultimedia content analysisOpen Educational Resources