🤖 AI Summary
This work addresses the limitations of existing diffusion-based material transfer methods, which often rely on textual guidance or complex auxiliary networks, resulting in high computational costs and difficulties in feature alignment. To overcome these challenges, the authors propose MaTe, a lightweight diffusion framework that achieves high-quality, zero-shot, and training-free material transfer without requiring text prompts or reference networks. Built upon a diffusion Transformer architecture, MaTe performs token-level image fusion in a shared latent space through a multimodal attention mechanism, deliberately avoiding redundant components such as adapters, ControlNet, inversion sampling, or model fine-tuning. Experimental results demonstrate that MaTe surpasses state-of-the-art methods in visual quality, detail alignment accuracy, and inference efficiency, significantly simplifying deployment requirements.
📝 Abstract
Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.