🤖 AI Summary
To address the trade-off between efficiency and modeling capability in long-duration high-resolution video quality assessment (VQA), this paper proposes a lightweight and efficient VQA method based on the Mamba state-space model. Our approach introduces two key innovations: (1) a novel Unified Semantic-Distortion Sampling (USDS) mechanism that jointly leverages low-resolution semantic patches and full-resolution distorted patches; and (2) a mask-driven dual-stream fusion architecture that enables complementary feature integration without additional computational overhead. By integrating multi-scale video patch sampling with lightweight sequence modeling, our method preserves long-range dependency capture while substantially reducing resource consumption. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, achieving a 2× speedup in inference time and reducing GPU memory usage to only 20% of baseline models—significantly advancing the feasibility of real-time VQA for long videos.
📝 Abstract
The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being $2 imes$ as fast and requiring only $1/5$ GPU memory.