EA3D: Online Open-World 3D Object Extraction from Streaming Videos

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D scene understanding methods rely on offline multi-view data or pre-built geometry, limiting their applicability to online, dynamic 3D object extraction and semantic interpretation in open-world streaming video. This paper introduces the first unified online framework that jointly optimizes 3D reconstruction and semantic understanding from streaming video. It integrates vision-language encoders and 2D visual foundation models to parse frame-level content and incrementally construct and update a Gaussian feature map. A recurrent joint optimization module enhances cross-modal attention, while visual odometry combined with feedforward update strategies ensures real-time performance. Extensive experiments demonstrate significant improvements over state-of-the-art methods across photorealistic rendering, semantic/instance segmentation, 3D detection, and occupancy prediction—validating the framework’s effectiveness, real-time capability, and generalizability.

Technology Category

Application Category

📝 Abstract
Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Online 3D object extraction from streaming video data
Simultaneous geometric reconstruction and holistic scene understanding
Unified framework for joint 3D reconstruction and semantic interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online framework for open-world 3D object extraction
Integrates vision-language knowledge into Gaussian feature maps
Joint optimization enhances reconstruction and semantic understanding
🔎 Similar Papers
No similar papers found.
Xiaoyu Zhou
Xiaoyu Zhou
Peking University
Computer VisionAutonomous DrivingAI Security
J
Jingqi Wang
Wangxuan Institute of Computer Technology, Peking University
Y
Yuang Jia
Wangxuan Institute of Computer Technology, Peking University
Y
Yongtao Wang
Wangxuan Institute of Computer Technology, Peking University
Deqing Sun
Deqing Sun
Research Scientist, Google DeepMind
Computer VisionOptical FlowMachine LearningImage Processing
Ming-Hsuan Yang
Ming-Hsuan Yang
University of California at Merced; Google DeepMind
Computer VisionMachine LearningArtificial Intelligence