🤖 AI Summary
Cross-video question answering faces core challenges including difficulty in associating multi-source videos and complex spatiotemporal reasoning. To address these, we propose a person-centric hierarchical reasoning framework that constructs a multi-granularity generation tree and a multi-agent inference mechanism—enabling cross-video spatiotemporal relationship modeling without end-to-end training. Our method integrates person re-identification (ReID), object tracking, hierarchical feature organization, and collaborative multi-agent reasoning to yield an interpretable cross-video spatiotemporal relational network. Evaluated on our newly established benchmark CrossVideoQA, our approach achieves 71.93% accuracy in person identification, 83.75% in action analysis, and 51.67% in summarization and reasoning—substantially outperforming existing methods. This work pioneers the use of persons as the central semantic anchor for cross-video alignment, establishing a scalable and interpretable paradigm for open-domain video understanding.
📝 Abstract
Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning. Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross-video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest's superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross-video understanding by unifying multiple video streams through person-level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.