Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the image analogical generation problem A:A′::B:B′—generating a semantically consistent B′ given three input images (A, A′, B)—using off-the-shelf diffusion models (e.g., Stable Diffusion, SDXL) without relying on task-specific editing architectures (e.g., InstructPix2Pix or inpainting models). Methodologically, we introduce a Delta Interpolation mechanism to extract cross-image semantic differences, coupled with a token-level self-supervised consistency loss and zero-initialized text embedding initialization, enabling transferable and plug-and-play modeling of semantic transformations. To our knowledge, this is the first general-purpose analogical generation framework compatible with arbitrary diffusion models. Extensive experiments demonstrate significant improvements over prior methods across multiple benchmarks, with a 12.3% gain in quantitative metrics, superior semantic fidelity, and robust generalization across diverse diffusion backbones.

Technology Category

Application Category

📝 Abstract

How can we generate an image B' that satisfies A:A'::B:B', given the input images A,A' and B? Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (e.g. InstructPix2Pix. Inpainting models) rather than general diffusion models (e.g. Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A' and applies it to B to generate a plausible B'. To address model dependency, it is crucial to structure prompts in the form of a"Full Prompt"suitable for input to stable diffusion models, rather than using an"Instruction Prompt". To this end, we accurately extract the Difference between A and A' and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate more feasible B' in a model-agnostic manner.

Problem

Research questions and friction points this paper is trying to address.

Generate image B' satisfying A:A'::B:B' using general diffusion models

Isolate and apply differences between images A and A' to B

Ensure model-agnostic and precise difference extraction for image analogy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Difference Inversion isolates and applies image differences

Uses Delta Interpolation for precise difference extraction

Implements Token Consistency Loss for accurate training

🔎 Similar Papers

Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models