AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing active 3D reconstruction methods rely on hand-crafted, geometry-driven heuristics, leading to redundant observations and limited reconstruction quality improvement. To address this, we propose the first end-to-end vision-language-guided active reconstruction framework. First, we introduce a decoupled feedforward view uncertainty modeling mechanism that accurately estimates geometric confidence for unobserved regions without online optimization. Second, we integrate a vision-language model (VLM) to provide high-level semantic guidance, enabling diverse and information-rich viewpoint selection beyond purely geometric cues. Third, we unify 3D perception and active viewpoint planning into a feedforward architecture. Evaluated on scene-level (ScanNet) and object-level (Objaverse) benchmarks under sparse-view settings, our method significantly outperforms state-of-the-art approaches, achieving simultaneous improvements in reconstruction completeness and accuracy—demonstrating the effectiveness of synergistic semantic guidance and uncertainty modeling.

Technology Category

Application Category

📝 Abstract
Active 3D reconstruction enables an agent to autonomously select viewpoints to efficiently obtain accurate and complete scene geometry, rather than passively reconstructing scenes from pre-collected images. However, existing active reconstruction methods often rely on hand-crafted geometric heuristics, which can lead to redundant observations without substantially improving reconstruction quality. To address this limitation, we propose AREA3D, an active reconstruction agent that leverages feed-forward 3D reconstruction models and vision-language guidance. Our framework decouples view-uncertainty modeling from the underlying feed-forward reconstructor, enabling precise uncertainty estimation without expensive online optimization. In addition, an integrated vision-language model provides high-level semantic guidance, encouraging informative and diverse viewpoints beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime. Code will be made available at: https://github.com/TianlingXu/AREA3D .
Problem

Research questions and friction points this paper is trying to address.

Active 3D reconstruction autonomously selects viewpoints for efficient scene geometry.
Existing methods rely on geometric heuristics, causing redundant observations without quality improvement.
AREA3D integrates feed-forward 3D perception and vision-language guidance for accurate reconstruction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward 3D perception for active reconstruction
Vision-language guidance for semantic viewpoint selection
Decoupled uncertainty modeling without online optimization
🔎 Similar Papers
No similar papers found.
T
Tianling Xu
Southern University of Science and Technology
S
Shengzhe Gan
Southern University of Science and Technology
L
Leslie Gu
Harvard University
Y
Yuelei Li
California Institute of Technology
Fangneng Zhan
Fangneng Zhan
MIT
Neural RenderingGenerative Models
Hanspeter Pfister
Hanspeter Pfister
An Wang Professor of Computer Science, Harvard University
VisualizationComputer GraphicsComputer Vision