MessyKitchens: Contact-rich object-level 3D scene reconstruction

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the challenges of object-level 3D reconstruction in complex, cluttered scenes from monocular images, where object diversity, frequent occlusions, and the difficulty of modeling physical contacts pose significant obstacles. To this end, the authors introduce MessyKitchens, the first real-world dataset of cluttered kitchen scenes featuring high-fidelity annotations of object-to-object contact. Building upon the SAM 3D framework, they propose a Multi-Object Decoder (MOD) that enables end-to-end joint reconstruction of individual objects’ 3D shapes, poses, and physical contact relationships. Evaluated on MessyKitchens and three additional benchmarks, the method substantially outperforms existing approaches, achieving notable advances in registration accuracy and non-penetration constraints, thereby advancing physically plausible object-level 3D scene reconstruction.

Technology Category

Application Category

📝 Abstract

Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

Problem

Research questions and friction points this paper is trying to address.

3D scene reconstruction

object-level decomposition

physical plausibility

contact-rich scenes

monocular reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-level reconstruction

physically-plausible scenes

Multi-Object Decoder (MOD)