🤖 AI Summary
This work addresses the challenges of low fidelity and semantic distortion in e-commerce product image background replacement. We propose an end-to-end recontextualization framework built upon text-to-image diffusion models. Methodologically, we introduce the first integrated data synthesis pipeline combining image-to-video diffusion, intelligent inpainting/outpainting, and negative-sample augmentation, alongside a product representation disentanglement mechanism to jointly optimize structural consistency and attribute fidelity. Experiments on the ABO and proprietary e-commerce datasets demonstrate substantial improvements: FID decreases by 32%, CLIP-Score increases by 18%, and human evaluations show 41% and 53% gains in realism and product consistency, respectively—outperforming state-of-the-art methods. Our core contributions are: (1) the first controllable generation paradigm tailored for product recontextualization; (2) a disentangled product representation learning mechanism; and (3) a multi-stage synthesis strategy that jointly ensures photorealism and semantic consistency.
📝 Abstract
We present a framework for high-fidelity product image recontextualization using text-to-image diffusion models and a novel data augmentation pipeline. This pipeline leverages image-to-video diffusion, in/outpainting&negatives to create synthetic training data, addressing limitations of real-world data collection for this task. Our method improves the quality and diversity of generated images by disentangling product representations and enhancing the model's understanding of product characteristics. Evaluation on the ABO dataset and a private product dataset, using automated metrics and human assessment, demonstrates the effectiveness of our framework in generating realistic and compelling product visualizations, with implications for applications such as e-commerce and virtual product showcasing.