SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of preserving fine-grained details—such as text and patterns—on garments in diffusion-based virtual try-on, a task hindered by existing methods’ reliance on implicit spatial correspondence learning. To achieve precise geometric alignment, the authors propose the first integration of the classical SIFT feature matching algorithm into this domain. By extracting keypoints to generate explicit geometric guidance and incorporating domain-specific filtering, they derive spatial probability distributions that supervise cross-attention layers within the diffusion model. Evaluated on the VITON-HD dataset, the method significantly improves unpaired evaluation metrics while maintaining strong performance in paired reconstruction, demonstrably enhancing text legibility and pattern fidelity.

📝 Abstract

Diffusion-based virtual try-on methods achieve photorealistic synthesis through cross-attention mechanisms that transfer garment features to target body regions. However, these approaches rely on implicit learning of spatial correspondences, struggling to preserve fine details such as text and illustrations. We propose a novel approach, which we call SIFT-VTON, that utilizes SIFT keypoint matching to provide explicit geometric guidance for diffusion-based virtual try-on. Our method applies domain-specific filtering to SIFT keypoint matches between garment and person images, then converts these correspondences into spatial probability distributions that supervise cross-attention layers during training. This explicit supervision guides the model to learn precise spatial alignment, concentrating attention on geometrically consistent garment regions. Experiments on the VITON-HD dataset demonstrate significant improvements on unpaired metrics while maintaining competitive paired reconstruction metrics. Qualitative comparisons show superior preservation of text clarity and pattern alignment. Attention visualizations confirm that our method produces sharply focused attention on relevant garment details. This work demonstrates that classical geometric correspondence methods can effectively enhance modern diffusion models for conditional synthesis tasks. The source code will be available at https://github.com/takesukeDS/SIFT-VTON.

Problem

Research questions and friction points this paper is trying to address.

virtual try-on

diffusion models

geometric correspondence

cross-attention

detail preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SIFT keypoint matching

geometric correspondence

cross-attention supervision