Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

📅 2024-03-24

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 6

✨ Influential: 1

career value

193K/year

🤖 AI Summary

This work addresses the lack of a unified, controllable framework for text-guided image-to-image (I2I) translation. We propose FCDiffusion—the first end-to-end diffusion model based on frequency-domain modulation. It applies the discrete cosine transform (DCT) in the latent space to decouple features into low-, mid-, and high-frequency components, enabling fine-grained, text-conditioned control via a learnable frequency filtering module. By simply switching frequency-control branches, FCDiffusion seamlessly supports diverse tasks—including style creation, semantic editing, scene transfer, and style translation—without task-specific architectures. Integrated with latent diffusion models (LDMs) and text cross-attention, it achieves state-of-the-art performance across multiple benchmarks, demonstrating superior generation quality, precise controllability, and strong cross-task generalization. Code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract

Recently, text-to-image diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing flexible image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework contributing a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which extracts image features carrying different DCT spectral bands to control the text-to-image generation process of the Latent Diffusion Model, realizing versatile I2I applications including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related methods, FCDiffusion establishes a unified text-driven I2I framework suiting diverse I2I application scenarios simply by switching among different frequency control branches. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.

Problem

Research questions and friction points this paper is trying to address.

Proposes frequency-controlled diffusion model for text-guided image translation

Uses DCT spectral bands to control style, structure, and layout

Enables versatile image translation tasks via frequency-domain filtering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-domain filtering via Discrete Cosine Transform

Multi-band DCT control signals for diverse correlations

Unified framework for versatile image translation tasks

🔎 Similar Papers

FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation