Matrix3D: Large Photogrammetry Model All-in-One

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the longstanding reliance on task-specific models for pose estimation, depth prediction, and novel-view synthesis in photogrammetry. To this end, we propose Matrix3D—a unified multimodal diffusion Transformer. Methodologically, we design a mask-driven partially bimodal training scheme that accommodates heterogeneous weakly paired data (e.g., image–pose or image–depth pairs), and introduce cross-modal feature alignment coupled with an interactive fine-grained 3D control module. Our key contribution is the first end-to-end joint optimization of all three tasks within a single model, substantially enhancing robustness and generalization under modality absence. On standard benchmarks, Matrix3D achieves state-of-the-art performance in both pose estimation and novel-view synthesis, while significantly improving training data efficiency. This work establishes a new paradigm for general-purpose 3D vision modeling.

Technology Category

Application Category

📝 Abstract
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.
Problem

Research questions and friction points this paper is trying to address.

Unified model for multiple photogrammetry tasks
Integration of multi-modal transformations via diffusion transformer
Enhance training with incomplete multi-modal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal diffusion transformer
Mask learning strategy
Full-modality model training
🔎 Similar Papers
No similar papers found.