Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency

๐Ÿ“… 2025-02-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Conventional quality assessment methods struggle to model complex textual semantics and long-term dynamic motion for emerging text-to-video generative models like Sora. Method: We propose a multi-granularity text-temporal fusion and hybrid motion fidelity modeling framework. It introduces the first cross-granularity text-video alignment mechanism, integrates optical flow with temporal features for hybrid motion modeling, and designs a self-supervised motion fidelity assessment module. We further develop CRAVE-DBโ€”the first AIGC video benchmark tailored for complex prompts and rich contentโ€”and propose a human-perception-driven no-reference Video Quality Assessment (VQA) architecture. Contribution/Results: Our method achieves state-of-the-art performance across multiple benchmarks, attaining strong correlation with human judgments (SROCC = 0.92). Both code and dataset are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
The advent of next-generation video generation models like extit{Sora} poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose extbf{CRAVE} (underline{C}ontent-underline{R}ich underline{A}IGC underline{V}ideo underline{E}valuator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce extbf{CRAVE-DB}, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at https://github.com/littlespray/CRAVE.
Problem

Research questions and friction points this paper is trying to address.

Assess next-gen AIGC video quality
Align complex text with video dynamics
Evaluate temporal artifacts in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granularity text-temporal fusion
Hybrid motion-fidelity modeling
Content-rich AIGC video benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shangkun Sun
SECE, Peking University; PengCheng Laboratory
Xiaoyu Liang
Xiaoyu Liang
Tsinghua University
CO2 Conversion
Bowen Qu
Bowen Qu
Peking University, Ex: Rhymes.ai Aria Team
Multimodal learningVision-Language ModelsComputer Vision
W
Wei Gao
SECE, Peking University; PengCheng Laboratory