Preserve Anything: Controllable Image Synthesis with Object Preservation

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing text-to-image (T2I) methods suffer from significant limitations in multi-object fidelity, prompt semantic consistency, and scene controllability. To address these challenges, we propose an N-channel ControlNet framework that jointly models object preservation, background semantic alignment, and explicit layout/illumination control. We introduce the first multimodal benchmark integrating 240K natural images and 18K 3D-synthetic images for comprehensive evaluation. Our method incorporates high-resolution background guidance, illumination consistency constraints, and high-frequency detail superposition, augmented with dedicated modules for color fidelity, detail enhancement, and artifact suppression. Quantitative results demonstrate state-of-the-art performance, achieving an FID of 15.26 and a CLIP-Score of 32.85. User studies confirm substantial improvements—25% in prompt alignment, 19% in photorealism, 13% in artifact suppression, and 14% in aesthetic quality—over prior approaches.

Technology Category

Application Category

📝 Abstract

We introduce extit{Preserve Anything}, a novel method for controlled image synthesis that addresses key limitations in object preservation and semantic consistency in text-to-image (T2I) generation. Existing approaches often fail (i) to preserve multiple objects with fidelity, (ii) maintain semantic alignment with prompts, or (iii) provide explicit control over scene composition. To overcome these challenges, the proposed method employs an N-channel ControlNet that integrates (i) object preservation with size and placement agnosticism, color and detail retention, and artifact elimination, (ii) high-resolution, semantically consistent backgrounds with accurate shadows, lighting, and prompt adherence, and (iii) explicit user control over background layouts and lighting conditions. Key components of our framework include object preservation and background guidance modules, enforcing lighting consistency and a high-frequency overlay module to retain fine details while mitigating unwanted artifacts. We introduce a benchmark dataset consisting of 240K natural images filtered for aesthetic quality and 18K 3D-rendered synthetic images with metadata such as lighting, camera angles, and object relationships. This dataset addresses the deficiencies of existing benchmarks and allows a complete evaluation. Empirical results demonstrate that our method achieves state-of-the-art performance, significantly improving feature-space fidelity (FID 15.26) and semantic alignment (CLIP-S 32.85) while maintaining competitive aesthetic quality. We also conducted a user study to demonstrate the efficacy of the proposed work on unseen benchmark and observed a remarkable improvement of $sim25%$, $sim19%$, $sim13%$, and $sim14%$ in terms of prompt alignment, photorealism, the presence of AI artifacts, and natural aesthetics over existing works.

Problem

Research questions and friction points this paper is trying to address.

Enhances object preservation in text-to-image generation

Improves semantic alignment with user prompts

Provides explicit control over scene composition

Innovation

Methods, ideas, or system contributions that make the work stand out.

N-channel ControlNet for object preservation

High-resolution semantically consistent backgrounds

Explicit user control over layouts

🔎 Similar Papers

Imperceptible Protection against Style Imitation from Diffusion Models