🤖 AI Summary
Existing video scene graph generation methods struggle to jointly model fine-grained spatial relationships and long-range temporal dependencies, resulting in fragmented scene representations. To address the challenge of dynamic understanding in aerial videos, we propose a hierarchical recurrent interaction modeling framework: (1) multi-scale spatial feature aggregation enhances fine-grained object localization; (2) a recurrent temporal refinement mechanism captures long-range temporal dependencies; and (3) joint motion-appearance feature learning enables dynamic relation reasoning under aerial viewpoints. To support this work, we introduce AeroEye-v1.0—the first fine-grained video scene graph dataset tailored for aerial scenarios—covering five typical interaction categories. Extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that our method significantly improves spatiotemporal consistency and semantic accuracy of generated scene graphs, while exhibiting strong generalization across both ground-level and aerial scenes.
📝 Abstract
The rapid proliferation of video in applications such as autonomous driving, surveillance, and sports analytics necessitates robust methods for dynamic scene understanding. Despite advances in static scene graph generation and early attempts at video scene graph generation, previous methods often suffer from fragmented representations, failing to capture fine-grained spatial details and long-range temporal dependencies simultaneously. To address these limitations, we introduce the Temporal Hierarchical Cyclic Scene Graph (THYME) approach, which synergistically integrates hierarchical feature aggregation with cyclic temporal refinement to address these limitations. In particular, THYME effectively models multi-scale spatial context and enforces temporal consistency across frames, yielding more accurate and coherent scene graphs. In addition, we present AeroEye-v1.0, a novel aerial video dataset enriched with five types of interactivity that overcome the constraints of existing datasets and provide a comprehensive benchmark for dynamic scene graph generation. Empirically, extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that the proposed THYME approach outperforms state-of-the-art methods, offering improved scene understanding in ground-view and aerial scenarios.