🤖 AI Summary
This work addresses the challenge of automatically generating physically plausible, multi-channel PBR (Physically Based Rendering) materials from textual descriptions and 3D geometric information. To this end, it introduces the first video diffusion Transformer model capable of jointly synthesizing multiple material attributes—including base color, roughness, metallic, and height maps—and incorporates a compact variational autoencoder that maps multimodal material data into a unified latent space, thereby effectively controlling computational overhead. The proposed method produces high-quality, tool-compatible PBR materials that preserve physical realism while significantly accelerating the creation pipeline for 3D assets.
📝 Abstract
We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.