š¤ AI Summary
This work addresses the challenge in sparse-view 3D reconstruction of simultaneously achieving geometric completeness and accurate alignment with input observations. To this end, we propose Mix3R, a novel framework that, for the first time, integrates feedforward reconstruction with generative 3D priors through a Mixture-of-Transformers architecture. Mix3R jointly optimizes sparse voxels, per-view point maps, and camera poses across two stages. A key innovation is the introduction of overlapping attention biases, which enable precise, training-free 2Dā3D texture mapping and mutually enhance both geometric completeness and input alignment. Experiments demonstrate that Mix3R significantly outperforms purely generative approaches in input alignment while maintaining high geometric fidelity, and surpasses existing feedforward methods in camera pose estimation accuracy.
š Abstract
Recent trends in sparse-view 3D reconstruction have taken two different paths: feed-forward reconstruction that predicts pixel-aligned point maps without a complete geometry, and generative 3D reconstruction that generates complete geometry but often with poor input-alignment. We present Mix3R, a novel generative 3D reconstruction method which mixes feed-forward reconstruction and 3D generation into a single framework in an aligned manner. Mix3R generates a 3D shape in two stages: a sparse voxel generation stage and a textured geometry generation stage. Unlike pure generative methods, our first-stage generation jointly produces a coarse 3D structure (sparse voxels), per-view point maps and camera parameters aligned to that 3D structure. This is made possible by introducing a Mixture-of-Transformers architecture that inserts global self-attentions to a feed-forward reconstruction model and a 3D generative model, both pretrained on large-scale data. This design effectively retains the pretrained priors but enables better 2D-3D alignment. Based on the initial aligned generations of sparse 3D voxels and point maps, we compute an overlap-based attention bias that is directly added to another pretrained textured geometry generation model, enabling it to correctly place input textures onto generated shapes in a training-free manner. Our design brings mutual benefits to both feed-forward reconstruction and 3D generation: The feed-forward branch learns to ground its predictions to a generative 3D prior, and conversely, the 3D generation branch is conditioned on geometrically informative features from the feed-forward branch. As a result, our method produces 3D shapes with better input alignment compared with pure 3D generative methods, together with camera pose estimations more accurate than previous feed-forward reconstruction methods. Our project page is at https://jsnln.github.io/mix3r/