🤖 AI Summary
Existing 3D scene understanding methods rely on offline multi-view data or pre-built geometry, limiting their applicability to online, dynamic 3D object extraction and semantic interpretation in open-world streaming video. This paper introduces the first unified online framework that jointly optimizes 3D reconstruction and semantic understanding from streaming video. It integrates vision-language encoders and 2D visual foundation models to parse frame-level content and incrementally construct and update a Gaussian feature map. A recurrent joint optimization module enhances cross-modal attention, while visual odometry combined with feedforward update strategies ensures real-time performance. Extensive experiments demonstrate significant improvements over state-of-the-art methods across photorealistic rendering, semantic/instance segmentation, 3D detection, and occupancy prediction—validating the framework’s effectiveness, real-time capability, and generalizability.
📝 Abstract
Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.