🤖 AI Summary
This work addresses the distribution shift in sketch colorization caused by semantic misalignment between training and test data, which manifests as artifacts, low resolution, and poor controllability. To mitigate this, the authors propose a dual-branch framework that explicitly models the distinct data distributions during training and inference, coupled with a Gram regularization loss to enforce cross-domain distributional consistency. Leveraging an anime-specific Tagger network, fine-grained attributes are extracted from reference images to modulate the SDXL conditional encoder, while a dedicated texture-enhancement plugin module enables high-resolution, disentangled, and controllable reference-guided colorization. This approach is the first to directly minimize the train-inference distribution gap, achieving state-of-the-art performance in both visual quality and controllability. User studies and ablation experiments confirm its effectiveness in enhancing detail fidelity and colorization accuracy.
📝 Abstract
Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL's conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.