\textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work proposes a fully offline, high-fidelity virtual try-on framework designed to operate entirely on edge devices using only a single user image and a garment image. Addressing the privacy concerns and deployment limitations of existing cloud-dependent systems that rely on GPU servers, the method introduces a modular architecture comprising TeacherNet, GarmentNet, and TryonNet. It leverages feature-guided adversarial distillation, trajectory consistency loss, and a lightweight cross-modal alignment mechanism to achieve high-quality image synthesis without requiring large-scale pretraining. Evaluated on the VITON-HD and DressCode datasets, the approach generates results at 1024×768 resolution, matching or surpassing server-grade baselines in visual quality while maintaining computational efficiency suitable for standard mobile hardware.

Technology Category

Application Category

📝 Abstract

Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textsc{Mobile-VTON} introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On

On-Device Deployment

Privacy Preservation

Mobile Devices

Offline Inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-device virtual try-on

knowledge distillation

garment-conditioned generation