🤖 AI Summary
This work addresses the long-standing fragmentation among three weakly supervised anomaly detection paradigms—incomplete, inexact, and inaccurate supervision—due to the absence of a unified evaluation framework. We introduce WSADBench, the first comprehensive benchmark that systematically evaluates 36 algorithms across four data modalities through over 700,000 standardized experiments, explicitly controlling label quantity, granularity, and quality. Our study reveals strong intrinsic connections among the three weak supervision settings, challenging the prevailing assumption that they should be treated in isolation: specialized algorithms only outperform general ones under extreme label scarcity, whereas foundation models demonstrate superior generalization. Key findings include the limited benefit of unlabeled data and asymmetric model sensitivity to label noise. The benchmark and code are publicly released to advance the field.
📝 Abstract
Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanics. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE-AILAB/WSADBench.