🤖 AI Summary
This work addresses the longstanding limitation in video quality assessment, where smoothness has been treated merely as a subsidiary dimension of overall quality, hindering accurate modeling of human perception regarding motion consistency and frame continuity. To this end, we formally establish Video Fluidity Assessment (VFA) as an independent perceptual task, introduce FluVid—a dedicated dataset comprising 4,606 real-world videos—and propose the first standardized scoring protocol and subjective study methodology tailored for VFA. We further construct a large-scale benchmark by integrating 23 existing methods and present FluNet, a baseline model featuring a temporal permutation self-attention mechanism to enhance long-range inter-frame interactions. FluNet achieves state-of-the-art performance on FluVid, demonstrating the necessity and efficacy of explicitly modeling fluidity as a standalone attribute and providing a systematic framework for future research in this direction.
📝 Abstract
Accurately estimating humans' subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.