DEFT-VTON: Efficient Virtual Try-On with Consistent Generalised H-Transform

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and resource demands of end-to-end fine-tuning large models for virtual try-on (VTO), this paper proposes an efficient diffusion model fine-tuning framework. Methodologically, it integrates the generalized Doob’s *h*-transform, parameter-efficient fine-tuning (PEFT), and adaptive consistency distillation, updating only 1.42% of parameters while freezing the backbone. A lightweight architectural design enables rapid 15-step denoising. Experiments demonstrate that the method achieves state-of-the-art (SOTA) generation quality while reducing trainable parameters by nearly 4× compared to full fine-tuning. It significantly improves training stability and inference efficiency, and exhibits strong deployment feasibility.

Technology Category

Application Category

📝 Abstract
Diffusion models enable high-quality virtual try-on (VTO) with their established image synthesis abilities. Despite the extensive end-to-end training of large pre-trained models involved in current VTO methods, real-world applications often prioritize limited training and inference, serving, and deployment budgets for VTO. To solve this obstacle, we apply Doob's h-transform efficient fine-tuning (DEFT) for adapting large pre-trained unconditional models for downstream image-conditioned VTO abilities. DEFT freezes the pre-trained model's parameters and trains a small h-transform network to learn a conditional h-transform. The h-transform network allows training only 1.42 percent of the frozen parameters, compared to a baseline of 5.52 percent in traditional parameter-efficient fine-tuning (PEFT). To further improve DEFT's performance and decrease existing models' inference time, we additionally propose an adaptive consistency loss. Consistency training distills slow but high-performing diffusion models into a fast one while retaining performance by enforcing consistencies along the inference path. Inspired by constrained optimization, instead of distillation, we combine the consistency loss and the denoising score matching loss in a data-adaptive manner for fine-tuning existing VTO models at a low cost. Empirical results show the proposed DEFT-VTON method achieves state-of-the-art performance on VTO tasks, with as few as 15 denoising steps, while maintaining competitive results.
Problem

Research questions and friction points this paper is trying to address.

Efficient virtual try-on with limited training budgets
Reducing inference time while maintaining performance
Adapting pre-trained models for image-conditioned tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Doob's h-transform for efficient fine-tuning
Freezes pre-trained model, trains small h-network
Combines consistency loss with denoising score matching
🔎 Similar Papers
No similar papers found.