🤖 AI Summary
Existing video colorization methods suffer from temporal flickering or require extensive manual intervention, making it challenging to achieve both high fidelity and temporal consistency. This paper proposes the first language-conditioned diffusion model framework for automatic video colorization: semantic guidance is provided via generic text prompts and automatically generated segmentation masks to drive initial color generation; inter-frame color propagation is performed using RAFT optical flow, augmented by an inconsistency correction mechanism to suppress misalignment and flickering. To our knowledge, this is the first work to apply language-guided diffusion models to video colorization without manual color specification. Our method achieves state-of-the-art performance on DAVIS30 and VIDEVO20, outperforming prior approaches across PSNR, Colorfulness, and CDC metrics—demonstrating significant improvements in color accuracy and visual coherence.
📝 Abstract
Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.