Dynamic Frequency Modulation for Controllable Text-driven Image Generation

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in text-guided image generation where prompt modifications often induce unintended global structural changes, a problem exacerbated by existing methods that rely on heuristic feature map selection and exhibit limited stability. The study is the first to reveal the dynamic frequency evolution during diffusion—where low-frequency components govern structure and high-frequency components dominate texture—and proposes a training-free dynamic frequency modulation approach. By directly manipulating noise latent variables in the frequency domain and introducing a frequency-aware weighting function, the method enables precise semantic editing while preserving structural consistency. Extensive experiments demonstrate that this approach significantly outperforms state-of-the-art techniques, achieving superior semantic controllability without compromising visual fidelity.

Technology Category

Application Category

📝 Abstract
The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.
Problem

Research questions and friction points this paper is trying to address.

text-driven image generation
structure preservation
semantic modification
diffusion models
frequency spectrum
Innovation

Methods, ideas, or system contributions that make the work stand out.

frequency modulation
text-to-image generation
diffusion models
structure preservation
training-free control
T
Tiandong Shi
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
L
Ling Zhao
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
J
Ji Qi
School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China
Jiayi Ma
Jiayi Ma
Wuhan University
Computer VisionImage FusionImage Matching
Chengli Peng
Chengli Peng
Wuhan University
Semantic segmentation