🤖 AI Summary
Existing open-set 4D detection methods typically rely on frame-wise prediction or complex multi-stage pipelines, leading to temporal inconsistency and error accumulation; moreover, progress has long been hindered by the scarcity of large-scale sequential RGB-D video data with continuous 3D bounding box annotations. To address these limitations, we introduce DA4D—the first large-scale RGB-D video dataset explicitly designed for open-set 4D detection—with dense, temporally consistent 3D box annotations. We further propose an end-to-end geometric-aware spatiotemporal decoder that fuses pretrained multimodal features and employs multi-task learning alongside sequence-adaptive training to achieve global temporal consistency. Our approach significantly mitigates bounding box jitter and trajectory fragmentation. On DA4D, it achieves state-of-the-art detection accuracy and superior temporal stability, establishing a new paradigm for efficient and robust 4D detection.
📝 Abstract
Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.