🤖 AI Summary
Existing active 3D reconstruction methods rely on hand-crafted, geometry-driven heuristics, leading to redundant observations and limited reconstruction quality improvement. To address this, we propose the first end-to-end vision-language-guided active reconstruction framework. First, we introduce a decoupled feedforward view uncertainty modeling mechanism that accurately estimates geometric confidence for unobserved regions without online optimization. Second, we integrate a vision-language model (VLM) to provide high-level semantic guidance, enabling diverse and information-rich viewpoint selection beyond purely geometric cues. Third, we unify 3D perception and active viewpoint planning into a feedforward architecture. Evaluated on scene-level (ScanNet) and object-level (Objaverse) benchmarks under sparse-view settings, our method significantly outperforms state-of-the-art approaches, achieving simultaneous improvements in reconstruction completeness and accuracy—demonstrating the effectiveness of synergistic semantic guidance and uncertainty modeling.
📝 Abstract
Active 3D reconstruction enables an agent to autonomously select viewpoints to efficiently obtain accurate and complete scene geometry, rather than passively reconstructing scenes from pre-collected images. However, existing active reconstruction methods often rely on hand-crafted geometric heuristics, which can lead to redundant observations without substantially improving reconstruction quality. To address this limitation, we propose AREA3D, an active reconstruction agent that leverages feed-forward 3D reconstruction models and vision-language guidance. Our framework decouples view-uncertainty modeling from the underlying feed-forward reconstructor, enabling precise uncertainty estimation without expensive online optimization. In addition, an integrated vision-language model provides high-level semantic guidance, encouraging informative and diverse viewpoints beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime. Code will be made available at: https://github.com/TianlingXu/AREA3D .