🤖 AI Summary
Existing single-image 3D reconstruction methods suffer from limited generalizability to complex real-world scenes: scene-diverse approaches rely heavily on 3D supervision, while object-centric ones depend on image priors and lack scalability. This work introduces a “divide-and-conquer” paradigm: first estimating global depth and semantic layout from a single input image, then leveraging large-model priors for per-object refinement, and finally composing objects into a complete, geometrically consistent, and semantically aligned 3D scene. We propose the first hybrid reconstruction framework that jointly optimizes scene-level and object-level representations, featuring a fully modular architecture enabling plug-and-play component replacement without end-to-end training or fine-tuning. Evaluated on both synthetic and real indoor datasets, our method significantly outperforms state-of-the-art approaches, achieving the first successful single-image 3D reconstruction of cluttered, multi-object, geometrically coherent, and semantically grounded real-world scenes.
📝 Abstract
Single-view 3D reconstruction is currently approached from two dominant perspectives: reconstruction of scenes with limited diversity using 3D data supervision or reconstruction of diverse singular objects using large image priors. However, real-world scenarios are far more complex and exceed the capabilities of these methods. We therefore propose a hybrid method following a divide-and-conquer strategy. We first process the scene holistically, extracting depth and semantic information, and then leverage a single-shot object-level method for the detailed reconstruction of individual components. By following a compositional processing approach, the overall framework achieves full reconstruction of complex 3D scenes from a single image. We purposely design our pipeline to be highly modular by carefully integrating specific procedures for each processing step, without requiring an end-to-end training of the whole system. This enables the pipeline to naturally improve as future methods can replace the individual modules. We demonstrate the reconstruction performance of our approach on both synthetic and real-world scenes, comparing favorable against prior works. Project page: https://andreeadogaru.github.io/Gen3DSR.