Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing multi-subject image generation methods often suffer from identity inconsistency and limited compositional control due to their reliance on diffusion models that implicitly associate text prompts with reference images. To address this, this work proposes a hierarchical Concept-to-Appearance Guidance (CAG) framework that explicitly structures supervision to jointly optimize high-level semantic concepts and low-level appearance details. The approach innovatively incorporates VAE feature dropout during training to enhance semantic robustness and introduces a correspondence-aware masked attention mechanism to enable precise attribute binding and multi-subject composition. By integrating a vision-language model, VAE feature dropout strategy, masked attention module, and a Diffusion Transformer architecture, the method significantly improves prompt adherence and subject identity consistency in multi-subject generation, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.

Problem

Research questions and friction points this paper is trying to address.

multi-subject image generation

identity consistency

compositional control

text-to-image synthesis

subject fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Guidance

Concept-to-Appearance

VAE Dropout