🤖 AI Summary
This work addresses the limitations of existing fashion outfit generation methods, which inadequately leverage multimodal conditioning and lack large-scale, diverse datasets tailored to e-commerce scenarios, thereby struggling to achieve visually coherent ensembles. To overcome these challenges, the authors introduce Fashion130K, a novel multimodal e-commerce dataset encompassing diverse occasions, models, and garment types, along with a Unified Multimodal Conditioning (UMC) framework. The UMC framework employs an embedding refiner and a fusion Transformer enhanced with a redesigned attention mechanism to unify textual and visual prompts into a joint embedding space, effectively bridging the modality gap and enabling fine-grained diffusion-based generation. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods in both real-world applications and benchmark evaluations, achieving superior performance in visual consistency and related metrics.
📝 Abstract
Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.