Understanding Dynamic Scenes in Ego Centric 4D Point Clouds

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current egocentric 4D point cloud understanding for dynamic scenes is hindered by the absence of unified 4D annotations and task-driven evaluation. To address this, we introduce EgoDynamic4D—the first multi-task 4D visual question answering benchmark for dynamic scenes—featuring RGB-D videos, instance masks, and spatiotemporally consistent 4D bounding boxes. It incorporates explicit chain-of-thought annotations and a multidimensional evaluation framework to support fine-grained spatiotemporal reasoning about object motion, human–object interaction, and causal temporal relationships. Methodologically, we propose an instance-aware feature encoder, joint temporal-pose encoding, and spatially adaptive downsampling to compress raw 4D point clouds into compact, LLM-compatible temporal sequences, enabling end-to-end spatiotemporal reasoning. Extensive experiments on EgoDynamic4D demonstrate significant improvements over strong baselines, validating both the effectiveness and robustness of our framework.

Technology Category

Application Category

📝 Abstract
Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Lack unified 4D annotations for dynamic scenes
Missing task-driven evaluation for spatio-temporal reasoning
Need modeling human-object interactions in 4D
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified 4D annotations for dynamic scenes
End-to-end spatio-temporal reasoning framework
Instance-aware feature encoding for LLMs
🔎 Similar Papers
No similar papers found.