IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the challenge of jointly controlling object categories, quantities, and spatial layouts in multi-object image editing, this paper introduces a novel task termed “Quantity-and-Layout-Consistent Editing” (QL-Edit). Methodologically, we propose a Harmony-Aware Attention (HA) mechanism that explicitly models both object counts and relative positional relationships. Additionally, we design a Preference-guided Noise Selection (PNS) strategy that integrates vision-language matching for noise filtering with structure-aware attention to achieve multimodal semantic alignment within diffusion models. Evaluated on our newly constructed benchmark, HarmonyBench, our approach significantly outperforms state-of-the-art methods, achieving superior performance in both structural consistency and semantic accuracy. This work establishes a new paradigm for controllable multi-object image editing.

Technology Category

Application Category

📝 Abstract

Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at https://github.com/muzishen/IMAGHarmony.

Problem

Research questions and friction points this paper is trying to address.

Control object quantity and layout in image editing

Improve multi-object editing accuracy and consistency

Enhance generation stability with aligned initial noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Harmony-aware attention integrates multimodal semantics

Preference-guided noise selection enhances generation stability

Structure-aware framework ensures layout and quantity consistency

🔎 Similar Papers

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control