🤖 AI Summary
Real-world videos often suffer from complex, heterogeneous degradations—including noise, compression artifacts, and low-light distortions—posing significant challenges for generalization across diverse and compound degradation types. To address this, we propose the first hybrid agent system for video restoration, inspired by human expert collaboration, comprising three synergistic modules: degradation identification, adaptive routing-based restoration, and quality assessment. Our approach introduces a novel multi-agent architecture featuring a learnable routing mechanism that integrates vision-language models with large language models. Furthermore, we develop Res-VQ, the first restoration-oriented video quality assessment model, along with its dedicated benchmark dataset, Res-VQ-Bench. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art methods in both objective metrics and perceptual quality, particularly under challenging, composite degradation scenarios.
📝 Abstract
Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first underline{M}ixture-underline{o}f-underline{A}gents underline{V}ideo underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the underline{Res}tored underline{V}ideo underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.