🤖 AI Summary
To address challenges in video quality assessment (VQA)—including difficulty in spatiotemporal perception modeling, incompatibility of backbone enhancement strategies, and absence of adaptive restoration mechanisms—this paper proposes a free-energy-guided dual-branch eye-simulation framework. The framework decouples global aesthetic and local structural semantic modeling, incorporates a biologically inspired saccade prediction head to emulate dynamic visual attention, and designs a video self-restoration mechanism grounded in the principle of free-energy minimization. It achieves non-intrusive injection of high-order features via patch-wise and full-frame collaborative enhancement, coupled with saliency-guided dynamic feature fusion. Evaluated on five mainstream VQA benchmarks, the method achieves state-of-the-art or leading performance, significantly improving both prediction accuracy and interpretability. Results validate the effectiveness of neuro-perceptual mechanism modeling for VQA.
📝 Abstract
Free-energy-guided self-repair mechanisms have shown promising results in image quality assessment (IQA), but remain under-explored in video quality assessment (VQA), where temporal dynamics and model constraints pose unique challenges. Unlike static images, video content exhibits richer spatiotemporal complexity, making perceptual restoration more difficult. Moreover, VQA systems often rely on pre-trained backbones, which limits the direct integration of enhancement modules without affecting model stability. To address these issues, we propose EyeSimVQA, a novel VQA framework that incorporates free-energy-based self-repair. It adopts a dual-branch architecture, with an aesthetic branch for global perceptual evaluation and a technical branch for fine-grained structural and semantic analysis. Each branch integrates specialized enhancement modules tailored to distinct visual inputs-resized full-frame images and patch-based fragments-to simulate adaptive repair behaviors. We also explore a principled strategy for incorporating high-level visual features without disrupting the original backbone. In addition, we design a biologically inspired prediction head that models sweeping gaze dynamics to better fuse global and local representations for quality prediction. Experiments on five public VQA benchmarks demonstrate that EyeSimVQA achieves competitive or superior performance compared to state-of-the-art methods, while offering improved interpretability through its biologically grounded design.