🤖 AI Summary
To address the conflicting challenges in zero-shot object image customization—namely, high computational cost, parameter redundancy, background misalignment, and degraded foreground detail fidelity—this paper proposes a lightweight Masked Diffusion Transformer architecture operating directly in the autoencoder’s latent space. We introduce a novel decoupled conditional design with a learnable conditional aggregator that explicitly separates background alignment and foreground detail representations, significantly reducing conditional dimensionality. Evaluated on the VITON-HD benchmark, our method achieves state-of-the-art performance across all major metrics: PSNR, FID, SSIM, and LPIPS. It reduces model parameters by 75%, accelerates inference by 2.5×, and cuts GPU memory consumption to two-thirds of prior approaches—demonstrating an unprecedented balance between efficiency and high-fidelity generation.
📝 Abstract
We propose E-MD3C ($underline{E}$fficient $underline{M}$asked $underline{D}$iffusion Transformer with Disentangled $underline{C}$onditions and $underline{C}$ompact $underline{C}$ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only $frac{1}{4}$ of the parameters, our Transformer-based 468M model delivers $2.5 imes$ faster inference and uses $frac{2}{3}$ of the GPU memory compared to an 1720M Unet-based latent diffusion model.