PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing generative models often struggle with object identity loss, layout distortion, and color inconsistency in multi-object image synthesis. This work proposes a novel approach leveraging a pre-trained image-to-video diffusion model, enhanced with textual conditioning and synthetic trajectory data to simulate the smooth motion of objects from random initial positions to their target layout. By exploiting the temporal priors inherent in video diffusion models, the method strengthens object identity consistency and spatial coherence without requiring additional training. Evaluated solely through inference-time layout optimization, the approach significantly outperforms current state-of-the-art methods in both quantitative metrics and user studies, achieving higher object retention rates, lower omission and duplication rates, and superior visual fidelity.

Technology Category

Application Category

📝 Abstract

Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.

Problem

Research questions and friction points this paper is trying to address.

multi-object compositing

identity preservation

photorealistic synthesis

layout control

background fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

video diffusion

identity preservation

multi-object compositing