Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

šŸ“… 2026-05-05
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF

career value

209K/year
šŸ¤– AI Summary
This work addresses the challenge in sparse-view 3D reconstruction of simultaneously achieving geometric completeness and accurate alignment with input observations. To this end, we propose Mix3R, a novel framework that, for the first time, integrates feedforward reconstruction with generative 3D priors through a Mixture-of-Transformers architecture. Mix3R jointly optimizes sparse voxels, per-view point maps, and camera poses across two stages. A key innovation is the introduction of overlapping attention biases, which enable precise, training-free 2D–3D texture mapping and mutually enhance both geometric completeness and input alignment. Experiments demonstrate that Mix3R significantly outperforms purely generative approaches in input alignment while maintaining high geometric fidelity, and surpasses existing feedforward methods in camera pose estimation accuracy.
šŸ“ Abstract
Recent trends in sparse-view 3D reconstruction have taken two different paths: feed-forward reconstruction that predicts pixel-aligned point maps without a complete geometry, and generative 3D reconstruction that generates complete geometry but often with poor input-alignment. We present Mix3R, a novel generative 3D reconstruction method which mixes feed-forward reconstruction and 3D generation into a single framework in an aligned manner. Mix3R generates a 3D shape in two stages: a sparse voxel generation stage and a textured geometry generation stage. Unlike pure generative methods, our first-stage generation jointly produces a coarse 3D structure (sparse voxels), per-view point maps and camera parameters aligned to that 3D structure. This is made possible by introducing a Mixture-of-Transformers architecture that inserts global self-attentions to a feed-forward reconstruction model and a 3D generative model, both pretrained on large-scale data. This design effectively retains the pretrained priors but enables better 2D-3D alignment. Based on the initial aligned generations of sparse 3D voxels and point maps, we compute an overlap-based attention bias that is directly added to another pretrained textured geometry generation model, enabling it to correctly place input textures onto generated shapes in a training-free manner. Our design brings mutual benefits to both feed-forward reconstruction and 3D generation: The feed-forward branch learns to ground its predictions to a generative 3D prior, and conversely, the 3D generation branch is conditioned on geometrically informative features from the feed-forward branch. As a result, our method produces 3D shapes with better input alignment compared with pure 3D generative methods, together with camera pose estimations more accurate than previous feed-forward reconstruction methods. Our project page is at https://jsnln.github.io/mix3r/
Problem

Research questions and friction points this paper is trying to address.

sparse-view 3D reconstruction
feed-forward reconstruction
generative 3D priors
multi-view alignment
pose estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Transformers
feed-forward reconstruction
generative 3D priors
multi-view alignment
training-free texture placement