MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to simultaneously ensure geometric consistency across multi-view generation and text-driven personalized customization: multi-view models lack explicit camera-pose control, while customization-oriented models fail to guarantee cross-view geometric consistency. This paper introduces the novel task of “multi-view customization,” unifying camera-pose control with text-prompt customization for the first time. Methodologically, we propose a diffusion-based feature field framework incorporating depth-aware feature rendering and consistent latent-space completion, further enhanced by integrating a text-to-video backbone with dense spatiotemporal attention to jointly model identity and geometry. Experiments demonstrate that our approach is the first framework capable of simultaneously achieving high-fidelity multi-view synthesis, precise text-guided customization, and strong geometric consistency—maintaining superior visual quality and cross-view coherence across diverse textual prompts.

Technology Category

Application Category

📝 Abstract
Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.
Problem

Research questions and friction points this paper is trying to address.

Achieving geometric consistency in multi-view customized diffusion models
Unifying camera pose control with prompt-based subject customization
Maintaining customization fidelity under limited training data constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses feature-field representation for subject geometry learning
Employs depth-aware rendering for geometric consistency enforcement
Implements latent completion for perspective alignment accuracy
🔎 Similar Papers
No similar papers found.