Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

๐Ÿ“… 2025-10-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current video quality assessment (VQA) models suffer from poor generalization under direct score supervision, limited interpretability, and difficulty adapting to emerging content typesโ€”including user-generated content (UGC), short videos, and AI-generated content (AIGC). To address these limitations, we propose Q-Router, a general-purpose VQA framework based on multi-level agent-style routing. It dynamically selects and weights multiple specialized expert models using vision-language models, enabling spatiotemporal artifact localization and real-time inference. Crucially, Q-Router formalizes routing as an interpretable decision process, facilitating content-adaptive quality prediction. Experiments demonstrate that Q-Router achieves state-of-the-art performance across multiple VQA benchmarks, significantly improving cross-dataset generalization. It also excels on the Q-Bench-Video question-answering task and validates its efficacy as a reward function for post-training video generation models.

Technology Category

Application Category

๐Ÿ“ Abstract
Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization across diverse video content types
Enhancing interpretability through artifact localization mechanisms
Increasing extensibility to novel video quality assessment tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework with multi-tier expert model routing
Vision-language models dynamically route appropriate expert models
Spatiotemporal artifact localization for improved interpretability
๐Ÿ”Ž Similar Papers
No similar papers found.