Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Current video quality assessment (VQA) models suffer from poor generalization under direct score supervision, limited interpretability, and difficulty adapting to emerging content types—including user-generated content (UGC), short videos, and AI-generated content (AIGC). To address these limitations, we propose Q-Router, a general-purpose VQA framework based on multi-level agent-style routing. It dynamically selects and weights multiple specialized expert models using vision-language models, enabling spatiotemporal artifact localization and real-time inference. Crucially, Q-Router formalizes routing as an interpretable decision process, facilitating content-adaptive quality prediction. Experiments demonstrate that Q-Router achieves state-of-the-art performance across multiple VQA benchmarks, significantly improving cross-dataset generalization. It also excels on the Q-Bench-Video question-answering task and validates its efficacy as a reward function for post-training video generation models.

Technology Category

Application Category

📝 Abstract

Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

Problem

Research questions and friction points this paper is trying to address.

Improving generalization across diverse video content types

Enhancing interpretability through artifact localization mechanisms

Increasing extensibility to novel video quality assessment tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework with multi-tier expert model routing

Vision-language models dynamically route appropriate expert models

Spatiotemporal artifact localization for improved interpretability

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Generative AI Systems Engineer – Vision-Language Models

Bosch Group

bangalore, IN

Authors to Follow