VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for grayscale video colorization suffer from inaccurate color prediction, poor controllability, and temporal incoherence—especially on complex or information-dense monochrome videos. This work addresses key challenges including chromatic overflow, temporal inconsistency, and weak semantic control, proposing an end-to-end multimodal diffusion framework enabling both global and local color controllability. Our approach introduces a novel Dual Q-Former for cross-modal alignment, integrates depth-guided generation with optical flow regularization, and proposes a luma-channel replacement strategy coupled with conditional chroma injection to jointly enhance fidelity, stability, and controllability. Technically, the framework unifies diffusion-based generation with Q-Former feature fusion, depth-map guidance, optical flow loss, luma preservation, and conditional color injection. Quantitative evaluation demonstrates significant improvements across benchmarks: +12.3% reduction in LPIPS (↑fidelity) and 41% reduction in flicker metric (↑temporal consistency). User studies confirm state-of-the-art visual quality and controllability.

Technology Category

Application Category

📝 Abstract
Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color fidelity.Project page: https://becauseimbatman0.github.io/VanGogh.
Problem

Research questions and friction points this paper is trying to address.

Colorization
Complex Video
Accuracy and Control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Colorization
Video Processing
Deep Learning
🔎 Similar Papers
No similar papers found.
Z
Zixun Fang
USTC
Zhiheng Liu
Zhiheng Liu
University of Hong Kong
Generative Model
K
Kai Zhu
USTC
Y
Yu Liu
Independent Researcher
Ka Leong Cheng
Ka Leong Cheng
HKUST
computer vision
W
Wei Zhai
USTC
Y
Yang Cao
USTC
Z
Zhengjun Zha
USTC