🤖 AI Summary
Existing methods struggle to simultaneously ensure geometric consistency across multi-view generation and text-driven personalized customization: multi-view models lack explicit camera-pose control, while customization-oriented models fail to guarantee cross-view geometric consistency. This paper introduces the novel task of “multi-view customization,” unifying camera-pose control with text-prompt customization for the first time. Methodologically, we propose a diffusion-based feature field framework incorporating depth-aware feature rendering and consistent latent-space completion, further enhanced by integrating a text-to-video backbone with dense spatiotemporal attention to jointly model identity and geometry. Experiments demonstrate that our approach is the first framework capable of simultaneously achieving high-fidelity multi-view synthesis, precise text-guided customization, and strong geometric consistency—maintaining superior visual quality and cross-view coherence across diverse textual prompts.
📝 Abstract
Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.