ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reference-based sketch coloring methods rely on semantically and spatially aligned triplets (sketch/reference image/ground-truth) for training; however, real-world reference images often exhibit significant spatial-semantic misalignment with sketches, leading to distribution shift, overfitting, and spatial artifacts. To address this, we propose a Dynamic Decoupling Carrier mechanism, introducing the first Split Cross-Attention module that jointly leverages spatial masking guidance, foreground-background separation encoding, background bleaching preprocessing, and dual-path latent-space feature transfer. Integrated into a diffusion framework, our method enables region-adaptive injection of reference information and precise detail alignment. Extensive quantitative evaluations across multiple metrics and user studies demonstrate substantial improvements over state-of-the-art methods—effectively eliminating misalignment artifacts while enhancing color accuracy, detail fidelity, and natural foreground-background integration.

Technology Category

Application Category

📝 Abstract
Reference-based sketch colorization methods have garnered significant attention due to their potential applications in the animation production industry. However, most existing methods are trained with image triplets of sketch, reference, and ground truth that are semantically and spatially well-aligned, while real-world references and sketches often exhibit substantial misalignment. This mismatch in data distribution between training and inference leads to overfitting, consequently resulting in spatial artifacts and significant degradation in overall colorization quality, limiting potential applications of current methods for general purposes. To address this limitation, we conduct an in-depth analysis of the extbf{carrier}, defined as the latent representation facilitating information transfer from reference to sketch. Based on this analysis, we propose a novel workflow that dynamically adapts the carrier to optimize distinct aspects of colorization. Specifically, for spatially misaligned artifacts, we introduce a split cross-attention mechanism with spatial masks, enabling region-specific reference injection within the diffusion process. To mitigate semantic neglect of sketches, we employ dedicated background and style encoders to transfer detailed reference information in the latent feature space, achieving enhanced spatial control and richer detail synthesis. Furthermore, we propose character-mask merging and background bleaching as preprocessing steps to improve foreground-background integration and background generation. Extensive qualitative and quantitative evaluations, including a user study, demonstrate the superior performance of our proposed method compared to existing approaches. An ablation study further validates the efficacy of each proposed component.
Problem

Research questions and friction points this paper is trying to address.

Addresses misalignment in reference-based sketch colorization training data
Reduces spatial artifacts and colorization quality degradation
Enhances spatial control and detail synthesis in colorization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic carrier adaptation optimizes colorization aspects
Split cross-attention with masks reduces spatial artifacts
Dedicated encoders enhance detail synthesis and control
🔎 Similar Papers
No similar papers found.
D
Dingkun Yan
Institute of Science Tokyo, School of Computing, Japan
X
Xinrui Wang
University of Tokyo, Japan
Yusuke Iwasawa
Yusuke Iwasawa
The University of Tokyo
deep learningtransfer learningfoundation modelmeta learning
Y
Yutaka Matsuo
University of Tokyo, Japan
S
Suguru Saito
Institute of Science Tokyo, School of Computing, Japan
Jiaxian Guo
Jiaxian Guo
Google Research
Efficient Foundation ModelReinforcement LearningCausality