🤖 AI Summary
Time-series anomaly detection lacks reliable evaluation metrics; among 37 widely used metrics, none simultaneously satisfy verifiable core properties—such as sensitivity, robustness, and monotonicity—leading to distorted performance assessment and posing risks to safety-critical systems.
Method: This work establishes the first theoretically grounded, verifiability-oriented framework for time-series anomaly detection evaluation, systematically exposing fundamental flaws in existing metrics. Building on this analysis, we propose LARM—the first metric provably satisfying all core properties—and its enhanced variant ALARM, supported by formal modeling, rigorous theoretical proofs, and extensive empirical validation across diverse benchmarks.
Contribution/Results: LARM and ALARM achieve significantly superior accuracy, consistency, and cross-method comparability compared to state-of-the-art metrics. They provide both a unified theoretical foundation and a practical, deployable tool for trustworthy anomaly detection evaluation.
📝 Abstract
Undetected anomalies in time series can trigger catastrophic failures in safety-critical systems, such as chemical plant explosions or power grid outages. Although many detection methods have been proposed, their performance remains unclear because current metrics capture only narrow aspects of the task and often yield misleading results. We address this issue by introducing verifiable properties that formalize essential requirements for evaluating time-series anomaly detection. These properties enable a theoretical framework that supports principled evaluations and reliable comparisons. Analyzing 37 widely used metrics, we show that most satisfy only a few properties, and none satisfy all, explaining persistent inconsistencies in prior results. To close this gap, we propose LARM, a flexible metric that provably satisfies all properties, and extend it to ALARM, an advanced variant meeting stricter requirements.