Benchmarking IoT Time-Series AD with Event-Level Augmentations

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limitations of existing time series anomaly detection methods for the Internet of Things (IoT), which predominantly rely on point-level evaluation and idealized data, thereby failing to capture event-level reliability and early-warning capability in real-world scenarios. To bridge this gap, we propose the first event-level enhanced evaluation framework tailored for IoT, incorporating realistic perturbations such as sensor dropouts, drifts, noise, and window shifts. We systematically benchmark 14 state-of-the-art models across seven datasets and introduce a channel masking probe to facilitate root-cause analysis. Our experiments reveal that no single model universally outperforms others: graph-based models exhibit greater robustness to long-duration events and dropout perturbations, density-based approaches are vulnerable to monotonic drift, and spectral CNNs excel in strongly periodic settings. Moreover, model architecture significantly influences robustness to various perturbations.

Technology Category

Application Category

📝 Abstract

Anomaly detection (AD) for safety-critical IoT time series should be judged at the event level: reliability and earliness under realistic perturbations. Yet many studies still emphasize point-level results on curated base datasets, limiting value for model selection in practice. We introduce an evaluation protocol with unified event-level augmentations that simulate real-world issues: calibrated sensor dropout, linear and log drift, additive noise, and window shifts. We also perform sensor-level probing via mask-as-missing zeroing with per-channel influence estimation to support root-cause analysis. We evaluate 14 representative models on five public anomaly datasets (SWaT, WADI, SMD, SKAB, TEP) and two industrial datasets (steam turbine, nuclear turbogenerator) using unified splits and event aggregation. There is no universal winner: graph-structured models transfer best under dropout and long events (e.g., on SWaT under additive noise F1 drops 0.804->0.677 for a graph autoencoder, 0.759->0.680 for a graph-attention variant, and 0.762->0.756 for a hybrid graph attention model); density/flow models work well on clean stationary plants but can be fragile to monotone drift; spectral CNNs lead when periodicity is strong; reconstruction autoencoders become competitive after basic sensor vetting; predictive/hybrid dynamics help when faults break temporal dependencies but remain window-sensitive. The protocol also informs design choices: on SWaT under log drift, replacing normalizing flows with Gaussian density reduces high-stress F1 from ~0.75 to ~0.57, and fixing a learned DAG gives a small clean-set gain (~0.5-1.0 points) but increases drift sensitivity by ~8x.

Problem

Research questions and friction points this paper is trying to address.

anomaly detection

IoT time series

event-level evaluation

realistic perturbations

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

event-level evaluation

time-series anomaly detection

sensor-level probing