TUN3D: Towards Real-World Scene Understanding from Unposed Images

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Addressing the practical constraint that consumer-grade cameras lack depth sensors and only capture multi-view RGB images, this paper proposes the first end-to-end framework for joint indoor scene layout estimation and 3D object detection—without requiring ground-truth camera poses or depth supervision. Methodologically, we design a lightweight sparse convolutional backbone network coupled with two dedicated decoders, and introduce a novel parameterized wall representation to significantly enhance geometric modeling accuracy. Evaluated on three major benchmarks—ScanNet, SUN RGB-D, and 3D-IJCV—our approach achieves state-of-the-art performance in layout estimation and matches the 3D detection accuracy of specialized depth-driven methods. To our knowledge, this is the first work to construct compact, semantically rich 3D spatial representations from pure RGB inputs without pose priors.

Technology Category

Application Category

📝 Abstract

Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .

Problem

Research questions and friction points this paper is trying to address.

Joint layout estimation and 3D object detection from unposed images

Eliminating dependency on depth sensors and camera pose supervision

Creating semantic spatial representations using multi-view visual-only input

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multi-view images without camera poses

Employs sparse-convolutional backbone with dual heads

Introduces parametric wall representation for layout

🔎 Similar Papers

Unsupervised Discovery of Object-Centric Neural Fields