DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-set 4D detection methods typically rely on frame-wise prediction or complex multi-stage pipelines, leading to temporal inconsistency and error accumulation; moreover, progress has long been hindered by the scarcity of large-scale sequential RGB-D video data with continuous 3D bounding box annotations. To address these limitations, we introduce DA4D—the first large-scale RGB-D video dataset explicitly designed for open-set 4D detection—with dense, temporally consistent 3D box annotations. We further propose an end-to-end geometric-aware spatiotemporal decoder that fuses pretrained multimodal features and employs multi-task learning alongside sequence-adaptive training to achieve global temporal consistency. Our approach significantly mitigates bounding box jitter and trajectory fragmentation. On DA4D, it achieves state-of-the-art detection accuracy and superior temporal stability, establishing a new paradigm for efficient and robust 4D detection.

Technology Category

Application Category

📝 Abstract
Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Achieving reliable 4D object detection in streaming RGB videos
Addressing temporal inconsistency and jitter in 3D bounding box predictions
Overcoming limitations of complex multi-stage pipelines and error propagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end framework predicts 3D bounding boxes from sequences
Fuses multi-modal features with geometry-aware spatiotemporal decoder
Multi-task learning maintains global consistency across varying sequence lengths
🔎 Similar Papers
No similar papers found.