THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video scene graph generation methods struggle to jointly model fine-grained spatial relationships and long-range temporal dependencies, resulting in fragmented scene representations. To address the challenge of dynamic understanding in aerial videos, we propose a hierarchical recurrent interaction modeling framework: (1) multi-scale spatial feature aggregation enhances fine-grained object localization; (2) a recurrent temporal refinement mechanism captures long-range temporal dependencies; and (3) joint motion-appearance feature learning enables dynamic relation reasoning under aerial viewpoints. To support this work, we introduce AeroEye-v1.0—the first fine-grained video scene graph dataset tailored for aerial scenarios—covering five typical interaction categories. Extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that our method significantly improves spatiotemporal consistency and semantic accuracy of generated scene graphs, while exhibiting strong generalization across both ground-level and aerial scenes.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of video in applications such as autonomous driving, surveillance, and sports analytics necessitates robust methods for dynamic scene understanding. Despite advances in static scene graph generation and early attempts at video scene graph generation, previous methods often suffer from fragmented representations, failing to capture fine-grained spatial details and long-range temporal dependencies simultaneously. To address these limitations, we introduce the Temporal Hierarchical Cyclic Scene Graph (THYME) approach, which synergistically integrates hierarchical feature aggregation with cyclic temporal refinement to address these limitations. In particular, THYME effectively models multi-scale spatial context and enforces temporal consistency across frames, yielding more accurate and coherent scene graphs. In addition, we present AeroEye-v1.0, a novel aerial video dataset enriched with five types of interactivity that overcome the constraints of existing datasets and provide a comprehensive benchmark for dynamic scene graph generation. Empirically, extensive experiments on ASPIRe and AeroEye-v1.0 demonstrate that the proposed THYME approach outperforms state-of-the-art methods, offering improved scene understanding in ground-view and aerial scenarios.
Problem

Research questions and friction points this paper is trying to address.

Fragmented scene graph representations in videos
Lack of fine-grained spatial and temporal modeling
Limited datasets for aerial dynamic scene analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical feature aggregation for multi-scale context
Cyclic temporal refinement for consistency
AeroEye-v1.0 dataset for comprehensive benchmarking
🔎 Similar Papers
No similar papers found.