🤖 AI Summary
To address poor object-background consistency (e.g., spatial layout, shadows, reflections) and insufficient precision of text-only control in e-commerce image background inpainting, this paper proposes the first multimodal diffusion-based generation framework integrating both textual prompts and reference images. Methodologically, we design a dual-conditioning control mechanism that jointly leverages a text encoder and reference-image feature injection to co-model background spatial structure, illumination, and material properties. We further construct DreamEcom-400K, a high-quality e-commerce inpainting dataset comprising 400K samples. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in object-background consistency, visual naturalness, and style fidelity. To our knowledge, this is the first work to achieve high-fidelity, jointly text- and vision-driven e-commerce background generation, effectively enabling automated e-commerce image synthesis.
📝 Abstract
Although diffusion-based image genenation has been widely explored and applied, background generation tasks in e-commerce scenarios still face significant challenges. The first challenge is to ensure that the generated products are consistent with the given product inputs while maintaining a reasonable spatial arrangement, harmonious shadows, and reflections between foreground products and backgrounds. Existing inpainting methods fail to address this due to the lack of domain-specific data. The second challenge involves the limitation of relying solely on text prompts for image control, as effective integrating visual information to achieve precise control in inpainting tasks remains underexplored. To address these challenges, we introduce DreamEcom-400K, a high-quality e-commerce dataset containing accurate product instance masks, background reference images, text prompts, and aesthetically pleasing product images. Based on this dataset, we propose DreamPainter, a novel framework that not only utilizes text prompts for control but also flexibly incorporates reference image information as an additional control signal. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, maintaining high product consistency while effectively integrating both text prompt and reference image information.