AlignVTOFF: Texture-Spatial Feature Alignment for High-Fidelity Virtual Try-Off

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the structural distortions and loss of high-frequency texture commonly observed in existing virtual try-on (VTO) methods when handling complex geometric deformations. To mitigate these issues, the authors propose AlignVTO, a novel framework featuring a parallel U-Net architecture. It employs a trainable reference U-Net for multi-scale feature extraction and introduces a Texture-Spatial Feature Alignment (TSFA) mechanism that explicitly aligns garment texture and spatial information. Within a frozen denoising U-Net, TSFA integrates cross-attention and self-attention to achieve precise alignment. Extensive experiments demonstrate that AlignVTO outperforms current state-of-the-art methods across multiple benchmarks, significantly improving both structural accuracy and fidelity of high-frequency details in the generated images.

Technology Category

Application Category

📝 Abstract
Virtual Try-Off (VTOFF) is a challenging multimodal image generation task that aims to synthesize high-fidelity flat-lay garments under complex geometric deformation and rich high-frequency textures. Existing methods often rely on lightweight modules for fast feature extraction, which struggles to preserve structured patterns and fine-grained details, leading to texture attenuation during generation.To address these issues, we propose AlignVTOFF, a novel parallel U-Net framework built upon a Reference U-Net and Texture-Spatial Feature Alignment (TSFA). The Reference U-Net performs multi-scale feature extraction and enhances geometric fidelity, enabling robust modeling of deformation while retaining complex structured patterns. TSFA then injects the reference garment features into a frozen denoising U-Net via a hybrid attention design, consisting of a trainable cross-attention module and a frozen self-attention module. This design explicitly aligns texture and spatial cues and alleviates the loss of high-frequency information during the denoising process.Extensive experiments across multiple settings demonstrate that AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garment results with improved structural realism and high-frequency detail fidelity.
Problem

Research questions and friction points this paper is trying to address.

Virtual Try-On
Texture Preservation
Geometric Deformation
High-Fidelity Image Generation
Multimodal Image Synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Texture-Spatial Feature Alignment
Parallel U-Net
Hybrid Attention
High-Fidelity Virtual Try-On
Reference U-Net
🔎 Similar Papers
No similar papers found.
Yihan Zhu
Yihan Zhu
Center for Electron Microscopy; College of Chemical Engineering, Zhejiang University of Technology
Electron microscopyCatalysisEnergy storage and conversionX-ray diffraction
M
Mengying Ge
National Demonstration Center for Experimental Engineering Training Education, Shanghai University, 99 Shangda Road, Baoshan District, Shanghai 200444, China