TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Subject-driven image generation (SDIG) faces the challenge of simultaneously preserving subject identity and faithfully following editing instructions. This paper proposes TIDE, a target-guided diffusion enhancement framework that achieves high-quality subject-driven generation without test-time fine-tuning. Its core contributions are: (1) a novel target-supervised triplet alignment mechanism leveraging reference images, text instructions, and target images as joint supervision signals; and (2) a direct subject diffusion objective coupled with an implicit reward model trained on win–loss sample pairs, enabling dynamic balancing between subject fidelity and instruction adherence via contrastive learning and preference optimization. Extensive evaluations across standard benchmarks demonstrate that TIDE significantly outperforms existing methods—achieving consistent improvements in subject consistency, instruction following, structural control, image translation, and text-to-image interpolation.

Technology Category

Application Category

📝 Abstract

Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired "winning" (balanced preservation-compliance) and "losing" (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE's superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE's versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.

Problem

Research questions and friction points this paper is trying to address.

Balancing subject identity preservation with instruction compliance

Resolving tension in subject-driven image generation without fine-tuning

Achieving optimal preservation-compliance balance through target supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Target-supervised triplet alignment for subject adaptation

Direct Subject Diffusion objective with winning-losing targets

Implicit reward modeling without test-time fine-tuning

🔎 Similar Papers

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance