🤖 AI Summary
Existing 3D style transfer methods—particularly NeRF-based approaches—struggle to simultaneously preserve fine-grained detail fidelity and ensure multi-view consistency, often yielding viewpoint-dependent color/texture distortions. This paper proposes the first unified framework for multi-view-consistent 3D style transfer. It constructs a shared multimodal embedding space for text and image inputs, designs a multi-head style-decoupled implicit network, and introduces a cross-view style consistency loss alongside an incremental novel-style adaptation mechanism. By integrating multimodal alignment, contrastive learning, and cross-view regularization—without requiring 2D supervision—the method achieves high-fidelity, photorealistic rendering. Extensive evaluation on real-world datasets demonstrates superior detail preservation, strong generalization across unseen styles, and effective few-shot novel-style transfer. The framework supports hybrid guidance via both text and reference images, enabling flexible and controllable 3D stylization.
📝 Abstract
3D style transfer aims to generate stylized views of 3D scenes with specified styles, which requires high-quality generating and keeping multi-view consistency. Existing methods still suffer the challenges of high-quality stylization with texture details and stylization with multimodal guidance. In this paper, we reveal that the common training method of stylization with NeRF, which generates stylized multi-view supervision by 2D style transfer models, causes the same object in supervision to show various states (color tone, details, etc.) in different views, leading NeRF to tend to smooth the texture details, further resulting in low-quality rendering for 3D multi-style transfer. To tackle these problems, we propose a novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF. First, MM-NeRF projects multimodal guidance into a unified space to keep the multimodal styles consistency and extracts multimodal features to guide the 3D stylization. Second, a novel multi-head learning scheme is proposed to relieve the difficulty of learning multi-style transfer, and a multi-view style consistent loss is proposed to track the inconsistency of multi-view supervision data. Finally, a novel incremental learning mechanism is proposed to generalize MM-NeRF to any new style with small costs. Extensive experiments on several real-world datasets show that MM-NeRF achieves high-quality 3D multi-style stylization with multimodal guidance, and keeps multi-view consistency and style consistency between multimodal guidance.