Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Text-guided image/video color editing requires fine-grained control while preserving geometric structure and physical consistency (e.g., albedo, light color, ambient illumination); however, existing training-free methods struggle to balance precision with regional coherence. We propose ColorCtrl—a training-free, general-purpose editing framework built upon a multimodal diffusion Transformer (MM-DiT). It employs a decoupled attention mechanism to disentangle structural and chromatic representations, enabling word-level attribute strength modulation and precise spatially localized editing via text prompts. Evaluated on SD3 and FLUX.1-dev, ColorCtrl achieves state-of-the-art performance, substantially outperforming FLUX.1 Kontext Max and GPT-4o. In video editing, it demonstrates superior temporal coherence and cross-frame physical consistency—maintaining realistic lighting and material properties throughout the sequence.

Technology Category

Application Category

📝 Abstract

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

Problem

Research questions and friction points this paper is trying to address.

Precise color editing in images and videos with text guidance

Maintaining physical consistency in geometry and lighting during edits

Achieving word-level control of color attribute intensity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Multi-Modal Diffusion Transformer attention

Disentangles structure and color via attention maps

Enables word-level control of color intensity

🔎 Similar Papers

TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer