🤖 AI Summary
Existing methods for high-fidelity, customizable indoor scene texture generation often fall short in enabling fine-grained, instance-level control, frequently suffering from low visual quality, artifacts, and entanglement with baked-in lighting. To address these limitations, this work proposes CustomTex, a framework that leverages multiple reference images and introduces a dual distillation mechanism—operating at both semantic and pixel levels—within a Variational Score Distillation (VSD) optimization framework. By incorporating instance-aware cross-attention, CustomTex achieves precise alignment between reference images and 3D scene instances. This approach substantially enhances instance-wise texture consistency, sharpness, and disentanglement from illumination, effectively suppressing artifacts while enabling high-quality, user-friendly appearance customization at the object instance level.
📝 Abstract
The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.