🤖 AI Summary
Virtual try-on faces a fundamental challenge in simultaneously achieving precise structural alignment and high-fidelity texture preservation. This work proposes a diffusion-based collaborative generation framework that dynamically decouples structural and textural modeling through a temporal architecture during the continuous denoising process: a structure-biased model first constructs a geometrically consistent latent skeleton, which is then handed over to a texture-biased model for rendering photorealistic details. The method uniquely enables seamless transition between structure- and texture-dominant stages within a single generative pipeline and introduces a latent-space handover mechanism to ensure information coherence. Evaluated on the VITON-HD dataset, the approach achieves state-of-the-art performance in both structural alignment accuracy and perceptual realism, attaining Pareto optimality.
📝 Abstract
Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person's body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.