Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the lack of effective quality and temporal coherence assessment methods for AI-generated videos, this paper proposes a three-level semantic-aware evaluation framework operating at frame-, segment-, and video-levels. It integrates CLIP’s text encoder for semantic supervision with cross-attention mechanisms to enable fine-grained semantic alignment between text prompts and generated content, as well as modeling of subtle inter-frame variations. We introduce two novel modules: a Prompt Semantic Supervision Module and a Semantic Anomaly Detection Module, establishing the first multi-granularity semantic quality assessment paradigm specifically designed for AI-generated videos. Hierarchical feature aggregation and a dedicated semantic alignment loss are incorporated to significantly enhance discrimination of semantic consistency and temporal coherence. Extensive experiments on major AI video generation benchmarks demonstrate state-of-the-art performance, comprehensively outperforming conventional user-generated content (UGC)-oriented, full-reference, and no-reference video quality assessment methods.

Technology Category

Application Category

📝 Abstract

The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.

Problem

Research questions and friction points this paper is trying to address.

AI-generated video

quality evaluation

coherence assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

MSA-VQA

Multi-layer Semantic Understanding

CLIP-based Video Quality Assessment

🔎 Similar Papers

Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model

2024-07-31Citations: 3