Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View

📅 2024-04-04

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

199K/year

🤖 AI Summary

Existing single-image 3D reconstruction methods suffer from limited generalizability to complex real-world scenes: scene-diverse approaches rely heavily on 3D supervision, while object-centric ones depend on image priors and lack scalability. This work introduces a “divide-and-conquer” paradigm: first estimating global depth and semantic layout from a single input image, then leveraging large-model priors for per-object refinement, and finally composing objects into a complete, geometrically consistent, and semantically aligned 3D scene. We propose the first hybrid reconstruction framework that jointly optimizes scene-level and object-level representations, featuring a fully modular architecture enabling plug-and-play component replacement without end-to-end training or fine-tuning. Evaluated on both synthetic and real indoor datasets, our method significantly outperforms state-of-the-art approaches, achieving the first successful single-image 3D reconstruction of cluttered, multi-object, geometrically coherent, and semantically grounded real-world scenes.

Technology Category

Application Category

📝 Abstract

Single-view 3D reconstruction is currently approached from two dominant perspectives: reconstruction of scenes with limited diversity using 3D data supervision or reconstruction of diverse singular objects using large image priors. However, real-world scenarios are far more complex and exceed the capabilities of these methods. We therefore propose a hybrid method following a divide-and-conquer strategy. We first process the scene holistically, extracting depth and semantic information, and then leverage a single-shot object-level method for the detailed reconstruction of individual components. By following a compositional processing approach, the overall framework achieves full reconstruction of complex 3D scenes from a single image. We purposely design our pipeline to be highly modular by carefully integrating specific procedures for each processing step, without requiring an end-to-end training of the whole system. This enables the pipeline to naturally improve as future methods can replace the individual modules. We demonstrate the reconstruction performance of our approach on both synthetic and real-world scenes, comparing favorable against prior works. Project page: https://andreeadogaru.github.io/Gen3DSR.

Problem

Research questions and friction points this paper is trying to address.

Reconstruct diverse 3D scenes from single views

Overcome limitations of current single-object or low-diversity methods

Enable modular improvements without end-to-end retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid method with divide-and-conquer strategy

Modular pipeline with independent self-contained modules

Combines holistic depth-semantic and object-level reconstruction

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

3D Computer Vision Researcher

Kitware

Arlington, Virginia

Authors to Follow