Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
Existing subject-driven text-to-image generation methods struggle to preserve high-frequency identity details—such as text, logos, and patterns—during substantial edits, often resulting in detail degradation. To address this, this work proposes a two-stage framework that first predicts a Canny edge map to decouple structure from appearance and then fuses the source image’s appearance with the predicted structure to generate high-fidelity outputs. Additionally, the authors introduce the first text-aware dataset comprising 100,000 automatically annotated image-text pairs, designed to enhance cross-view text consistency. By incorporating an intermediate structural prediction mechanism, the proposed method significantly outperforms baseline approaches in preserving subject-specific details, demonstrating the efficacy of structure-appearance disentanglement for high-fidelity subject-driven image generation.
📝 Abstract
Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

subject-driven generation
high-frequency details
text-to-image
identity preservation
detail degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

intermediate structural prediction
subject-driven generation
Canny map conditioning
text-aware dataset
high-fidelity image synthesis