TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Subject-driven image generation (SDIG) faces the challenge of simultaneously preserving subject identity and faithfully following editing instructions. This paper proposes TIDE, a target-guided diffusion enhancement framework that achieves high-quality subject-driven generation without test-time fine-tuning. Its core contributions are: (1) a novel target-supervised triplet alignment mechanism leveraging reference images, text instructions, and target images as joint supervision signals; and (2) a direct subject diffusion objective coupled with an implicit reward model trained on win–loss sample pairs, enabling dynamic balancing between subject fidelity and instruction adherence via contrastive learning and preference optimization. Extensive evaluations across standard benchmarks demonstrate that TIDE significantly outperforms existing methods—achieving consistent improvements in subject consistency, instruction following, structural control, image translation, and text-to-image interpolation.

Technology Category

Application Category

📝 Abstract
Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired "winning" (balanced preservation-compliance) and "losing" (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE's superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE's versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.
Problem

Research questions and friction points this paper is trying to address.

Balancing subject identity preservation with instruction compliance
Resolving tension in subject-driven image generation without fine-tuning
Achieving optimal preservation-compliance balance through target supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Target-supervised triplet alignment for subject adaptation
Direct Subject Diffusion objective with winning-losing targets
Implicit reward modeling without test-time fine-tuning
🔎 Similar Papers
No similar papers found.
J
Jibai Lin
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing
B
Bo Ma
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing
Y
Yating Yang
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing
R
Rong Ma
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing
T
Turghun Osman
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing
A
Ahtamjan Ahmat
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing
Rui Dong
Rui Dong
Ph.D. candidate, University of Michigan
program synthesisformal methodsprogram verification
L
Lei Wang
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing
X
Xi Zhou
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Xinjiang Laboratory of Minority Speech and Language Information Processing