CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing methods struggle to generate realistic shadows in multi-object image synthesis that are geometrically plausible, properly attached, and spatially consistent. This work presents the first systematic solution to this challenge by introducing a collaborative shadow generation framework based on pretrained text-to-image diffusion models. The approach injects multi-scale spatial features through an image pathway while encoding object-specific shadow bounding boxes into positional tokens via a text pathway. A novel cross-attention mechanism, coupled with an attention alignment loss, enforces both semantic coherence and spatial fidelity of the generated shadows. Evaluated on both single- and multi-object shadow synthesis tasks, the proposed method achieves state-of-the-art performance, significantly enhancing the photorealism of composite images.

Technology Category

Application Category

📝 Abstract

Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.

Problem

Research questions and friction points this paper is trying to address.

multi-object shadow generation

image compositing

shadow consistency

realistic shadow synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-object shadow generation

diffusion model

cross-attention

positional tokens