EA3D: Online Open-World 3D Object Extraction from Streaming Videos

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing 3D scene understanding methods rely on offline multi-view data or pre-built geometry, limiting their applicability to online, dynamic 3D object extraction and semantic interpretation in open-world streaming video. This paper introduces the first unified online framework that jointly optimizes 3D reconstruction and semantic understanding from streaming video. It integrates vision-language encoders and 2D visual foundation models to parse frame-level content and incrementally construct and update a Gaussian feature map. A recurrent joint optimization module enhances cross-modal attention, while visual odometry combined with feedforward update strategies ensures real-time performance. Extensive experiments demonstrate significant improvements over state-of-the-art methods across photorealistic rendering, semantic/instance segmentation, 3D detection, and occupancy prediction—validating the framework’s effectiveness, real-time capability, and generalizability.

Technology Category

Application Category

📝 Abstract

Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Online 3D object extraction from streaming video data

Simultaneous geometric reconstruction and holistic scene understanding

Unified framework for joint 3D reconstruction and semantic interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online framework for open-world 3D object extraction

Integrates vision-language knowledge into Gaussian feature maps

Joint optimization enhances reconstruction and semantic understanding

🔎 Similar Papers

No similar papers found.