🤖 AI Summary
To address the challenge of balancing model lightweighting and performance in video saliency prediction, this paper proposes two efficient architectures: ViNet-S (36 MB) and ViNet-A (148 MB). Methodologically, it introduces spatiotemporal action localization (STAL) features—previously unexplored in saliency modeling—as a replacement for conventional action classification backbones; designs an ultra-lightweight U-Net–style convolutional decoder to achieve parameter reduction without accuracy degradation; and proposes a training-free, multi-dataset averaging ensemble strategy to improve generalization. Extensive experiments demonstrate state-of-the-art performance across three purely visual and six audiovisual saliency benchmarks. Notably, ViNet-S achieves over 1000 fps inference speed—significantly outperforming existing Transformer-based approaches—and establishes new benchmarks in both parameter efficiency and real-time capability.
📝 Abstract
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.