DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing Structure-from-Motion (SfM) methods typically rely on a two-stage pipeline, exhibiting limited robustness to missing data and unbounded scene coordinates. This work introduces the first end-to-end multi-view diffusion model that jointly estimates 3D structure and camera poses in a global coordinate system directly from input images. Geometric structure and camera pose are formulated as the origins and endpoints of pixel rays, with structured ray parameters—specifically ray origins and endpoints—adopted as the diffusion model’s output variables for the first time. To enhance robustness against occlusion and scale variation, we propose coordinate normalization, mask-aware training, and explicit uncertainty modeling. A Transformer-based denoising architecture enables effective multi-view feature fusion and global ray parameterization. Extensive experiments on both synthetic and real-world datasets demonstrate state-of-the-art performance in sparse reconstruction accuracy, cross-scene generalization, and uncertainty calibration.

Technology Category

Application Category

📝 Abstract

Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.

Problem

Research questions and friction points this paper is trying to address.

Predicting 3D scene geometry from multi-view images

Estimating camera poses without global optimization

Handling missing data and unbounded scene coordinates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-driven multi-view reasoning for 3D geometry

Transformer-based denoising diffusion model

Specialized mechanisms for robust learning

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos