MV-RAG: Retrieval Augmented Multiview Diffusion

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of preserving 3D consistency and text alignment for rare or out-of-distribution (OOD) concepts in text-to-3D generation, this paper proposes a retrieval-augmented multi-view diffusion framework. Our method retrieves semantically relevant images from a large-scale real-world 2D image corpus to condition multi-view synthesis; introduces a hold-out view prediction objective to implicitly model 3D geometric consistency using only 2D data; and employs a hybrid training strategy jointly optimizing structured multi-view synthesis and realistic image reconstruction. Compared to state-of-the-art text-to-3D, image-to-3D, and personalized 3D generation methods, our approach significantly improves 3D consistency, visual realism, and text fidelity—particularly in OOD scenarios—while maintaining competitive performance on standard benchmarks.

Technology Category

Application Category

📝 Abstract
Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses failure in generating out-of-domain 3D concepts from text
Improves 3D consistency and accuracy for rare visual concepts
Enhances text adherence and photorealism in text-to-3D generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieves relevant 2D images from database
Conditions multiview diffusion on retrieved images
Trains with hybrid strategy using multiview data
🔎 Similar Papers
Y
Yosef Dayani
Hebrew University of Jerusalem
O
Omer Benishu
Hebrew University of Jerusalem
Sagie Benaim
Sagie Benaim
Assistant Professor, Hebrew University of Jerusalem
Computer VisionMachine Learning