🤖 AI Summary
This work addresses the limitations of existing object-centric 3D reconstruction methods, which rely on multi-stage pipelines, are sensitive to segmentation errors, and struggle with complex scenes. The authors propose the first end-to-end, object-count-agnostic unified framework that, from a single RGB image, simultaneously predicts the 6D poses and detailed 3D structures of all object instances in a single forward pass. Built upon a Transformer architecture, the method jointly predicts pixel-wise attributes—including CLIP-based category embeddings, depth, and NOCS coordinates—and a fixed set of canonical-space 3D Gaussians, trained with alignment-aware supervision. Evaluated on indoor benchmarks, the approach achieves state-of-the-art performance in monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, producing high-quality, editable reconstructions.
📝 Abstract
Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.