Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

📅 2024-03-24
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the lack of a unified, controllable framework for text-guided image-to-image (I2I) translation. We propose FCDiffusion—the first end-to-end diffusion model based on frequency-domain modulation. It applies the discrete cosine transform (DCT) in the latent space to decouple features into low-, mid-, and high-frequency components, enabling fine-grained, text-conditioned control via a learnable frequency filtering module. By simply switching frequency-control branches, FCDiffusion seamlessly supports diverse tasks—including style creation, semantic editing, scene transfer, and style translation—without task-specific architectures. Integrated with latent diffusion models (LDMs) and text cross-attention, it achieves state-of-the-art performance across multiple benchmarks, demonstrating superior generation quality, precise controllability, and strong cross-task generalization. Code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract
Recently, text-to-image diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing flexible image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework contributing a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which extracts image features carrying different DCT spectral bands to control the text-to-image generation process of the Latent Diffusion Model, realizing versatile I2I applications including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related methods, FCDiffusion establishes a unified text-driven I2I framework suiting diverse I2I application scenarios simply by switching among different frequency control branches. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. Our project is publicly available at: https://xianggao1102.github.io/FCDiffusion/.
Problem

Research questions and friction points this paper is trying to address.

Proposes frequency-controlled diffusion model for text-guided image translation
Uses DCT spectral bands to control style, structure, and layout
Enables versatile image translation tasks via frequency-domain filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-domain filtering via Discrete Cosine Transform
Multi-band DCT control signals for diverse correlations
Unified framework for versatile image translation tasks
🔎 Similar Papers
No similar papers found.
X
Xiang Gao
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Z
Zhengbo Xu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
J
Junhan Zhao
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Jiaying Liu
Jiaying Liu
Dalian University of Technology
Graph LearningData ScienceComputational Social Science