🤖 AI Summary
Underwater video monitoring suffers from low efficiency in manual screening. This paper proposes a vision-based automated framework for underwater visual anomaly detection (VAD) to identify rare or ecologically significant events. Our contributions are threefold: (1) We introduce AURA, the first multi-annotator, scene-diverse underwater VAD benchmark dataset with fine-grained semantic labels; (2) We propose a soft-label and consensus-label fusion strategy, systematically revealing the critical influence of training data scale and normal-sample diversity on model performance; (3) We evaluate four representative VAD model families under two typical marine scenarios using a robust frame-selection strategy, demonstrating substantial performance disparities in real-world underwater environments. This work establishes a new paradigm—scalable, high-accuracy intelligent monitoring of marine biodiversity—and provides empirical foundations for advancing underwater anomaly detection.
📝 Abstract
Underwater video monitoring is a promising strategy for assessing marine biodiversity, but the vast volume of uneventful footage makes manual inspection highly impractical. In this work, we explore the use of visual anomaly detection (VAD) based on deep neural networks to automatically identify interesting or anomalous events. We introduce AURA, the first multi-annotator benchmark dataset for underwater VAD, and evaluate four VAD models across two marine scenes. We demonstrate the importance of robust frame selection strategies to extract meaningful video segments. Our comparison against multiple annotators reveals that VAD performance of current models varies dramatically and is highly sensitive to both the amount of training data and the variability in visual content that defines "normal" scenes. Our results highlight the value of soft and consensus labels and offer a practical approach for supporting scientific exploration and scalable biodiversity monitoring.