🤖 AI Summary
Existing methods for high-resolution, multi-channel aligned SVBRDF texture generation suffer from poor cross-channel consistency and require either retraining the VAE or modifying the DiT backbone. Method: We propose CrossStitch, a lightweight module that models structural dependencies among SVBRDF channels (e.g., albedo, normal, roughness) via localized convolutions—integrated into a diffusion Transformer framework without altering the DiT backbone or retraining the VAE. Our approach enables native end-to-end 4K resolution generation, supported by targeted fine-tuning and memory-efficient design. Contribution/Results: Experiments demonstrate that CrossStitch generates high-fidelity, high-frequency-detailed SVBRDF textures with strong geometric consistency across channels under diverse text prompts. Moreover, the generated SVBRDFs exhibit robust generalization in downstream tasks such as intrinsic image decomposition, validating both fidelity and functional utility.
📝 Abstract
Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.