Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address geometric inconsistency, texture distortion, and low inference efficiency in single-image 3D reconstruction, this paper proposes an efficient and high-fidelity textured mesh generation framework. First, we design a cross-domain diffusion model based on Score Distillation Sampling (SDS) to jointly synthesize multi-view normal maps and RGB images. Second, we introduce a multi-view cross-domain attention mechanism to explicitly enforce geometric-appearance consistency across views. Third, we develop a cascaded 3D mesh extraction algorithm that achieves fine-grained surface reconstruction in approximately three minutes. Experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in reconstruction fidelity, geometric detail preservation, and inference speed—achieving up to 2.1× faster runtime while improving Chamfer distance by 28.7% and F-Score by 15.3% on the Objaverse benchmark. Moreover, the framework exhibits strong generalization across diverse object categories and real-world scenes, underscoring its practical applicability.

Technology Category

Application Category

📝 Abstract

In this work, we introduce extbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

Problem

Research questions and friction points this paper is trying to address.

Generating high-fidelity 3D textured meshes from single images

Overcoming time-consuming optimization and inconsistent geometry issues

Ensuring multi-view consistency and efficiency in 3D reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multi-view normal maps and color images

Employs cross-domain attention for view consistency

Uses coarse-to-fine mesh extraction for efficiency

🔎 Similar Papers

No similar papers found.