Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses scene-level 3D object detection and semantic mapping from uncalibrated images—i.e., without prior camera pose estimates. We propose the first end-to-end object-centric framework that jointly optimizes camera poses, object trajectories, and a global semantic 3D map, using oriented 3D bounding boxes as fundamental primitives. To enable metric-scale reconstruction, we replace conventional 2D feature matching with an object-centric correspondence mechanism. Our sparse, parametric mapping paradigm ensures map size scales linearly with the number of objects. Joint optimization is achieved via object-centric bundle adjustment, bounding box reprojection refinement, and multi-view geometric constraints. On CA-1M and ScanNet++, our method significantly outperforms state-of-the-art point-cloud- and voxel-based approaches, achieving superior localization accuracy, higher-fidelity maps, reduced model complexity, and stronger generalization across scenes.

Technology Category

Application Category

📝 Abstract

We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.

Problem

Research questions and friction points this paper is trying to address.

Detecting 3D objects in un-posed indoor images

Estimating camera poses and object tracks without prior metrics

Creating semantic 3D object maps from sparse object-centric data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses object-centric matcher for 3D boxes

Estimates camera poses and object tracks

Optimizes global 3D boxes for map quality

🔎 Similar Papers

F3Loc: Fusion and Filtering for Floorplan Localization