Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses three key challenges in 3D scene text-guided stylization: low stylization quality, cross-view inconsistency, and weak regional semantic control. We propose a novel method that jointly ensures global consistency and local controllability. Our approach leverages a 2D diffusion model, integrating depth-conditioned generation, mask-guided regional editing, and 3D retraining optimization. Key contributions include: (1) a single-reference attention sharing mechanism coupled with a multi-depth map grid strategy to explicitly model geometry-appearance coupling and enhance cross-view stylistic coherence; and (2) a semantic segmentation mask–based multi-region importance-weighted sliced Wasserstein distance loss for fine-grained, semantically aligned regional style mixing. Experiments demonstrate significant improvements over state-of-the-art methods in both qualitative realism and quantitative metrics (e.g., CLIP-Score, LPIPS), achieving— for the first time—cross-view consistency and precise semantic controllability without sacrificing visual fidelity.

Technology Category

Application Category

📝 Abstract

Recent advances in text-driven 3D scene editing and stylization, which leverage the powerful capabilities of 2D generative models, have demonstrated promising outcomes. However, challenges remain in ensuring high-quality stylization and view consistency simultaneously. Moreover, applying style consistently to different regions or objects in the scene with semantic correspondence is a challenging task. To address these limitations, we introduce techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer. Our method achieves stylization by re-training an initial 3D representation using stylized multi-view 2D images of the source views. Therefore, ensuring both style consistency and view consistency of stylized multi-view images is crucial. We achieve this by extending the style-aligned depth-conditioned view generation framework, replacing the fully shared attention mechanism with a single reference-based attention-sharing mechanism, which effectively aligns style across different viewpoints. Additionally, inspired by recent 3D inpainting methods, we utilize a grid of multiple depth maps as a single-image reference to further strengthen view consistency among stylized images. Finally, we propose Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models. We demonstrate that this optional feature enhances the faithfulness of style transfer and enables the mixing of different styles across distinct regions of the scene. Experimental evaluations, both qualitative and quantitative, demonstrate that our pipeline effectively improves the results of text-driven 3D stylization.

Problem

Research questions and friction points this paper is trying to address.

Ensuring high-quality 3D stylization with view consistency

Applying style consistently to different scene regions

Maintaining semantic correspondence during region-based style transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-based attention-sharing mechanism for style alignment

Multiple depth maps grid to enhance view consistency

Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss

🔎 Similar Papers

No similar papers found.