E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the conflicting challenges in zero-shot object image customization—namely, high computational cost, parameter redundancy, background misalignment, and degraded foreground detail fidelity—this paper proposes a lightweight Masked Diffusion Transformer architecture operating directly in the autoencoder’s latent space. We introduce a novel decoupled conditional design with a learnable conditional aggregator that explicitly separates background alignment and foreground detail representations, significantly reducing conditional dimensionality. Evaluated on the VITON-HD benchmark, our method achieves state-of-the-art performance across all major metrics: PSNR, FID, SSIM, and LPIPS. It reduces model parameters by 75%, accelerates inference by 2.5×, and cuts GPU memory consumption to two-thirds of prior approaches—demonstrating an unprecedented balance between efficiency and high-fidelity generation.

Technology Category

Application Category

📝 Abstract
We propose E-MD3C ($underline{E}$fficient $underline{M}$asked $underline{D}$iffusion Transformer with Disentangled $underline{C}$onditions and $underline{C}$ompact $underline{C}$ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only $frac{1}{4}$ of the parameters, our Transformer-based 468M model delivers $2.5 imes$ faster inference and uses $frac{2}{3}$ of the GPU memory compared to an 1720M Unet-based latent diffusion model.
Problem

Research questions and friction points this paper is trying to address.

Efficient zero-shot object customization
Lightweight masked diffusion transformers
Improved computational efficiency and memory usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight masked diffusion transformers
Disentangled condition design
Learnable Conditions Collector
🔎 Similar Papers
No similar papers found.