🤖 AI Summary
To address the challenges of ensuring communication reliability and mitigating delayed fault response in global IoT roaming platforms, this paper proposes ANCHOR, an active anomaly detection framework leveraging passive signaling traffic. ANCHOR integrates statistical rules, machine learning, and deep learning models into a unified unsupervised hybrid detection mechanism capable of identifying potential problematic customers exhibiting batch device connection anomalies across diverse vertical domains. Through co-optimization of engineering deployment and operational workflows, ANCHOR achieves efficient filtering and precise localization of anomalies within massive-scale signaling data streams on a production platform. It significantly improves the timeliness of fault预警 and service availability. To the best of our knowledge, ANCHOR represents the first end-to-end, customer-level connection anomaly detection system successfully deployed and operationalized in a large-scale IoT roaming environment.
📝 Abstract
Internet of Things (IoT) application providers rely on Mobile Network Operators (MNOs) and roaming infrastructures to deliver their services globally. In this complex ecosystem, where the end-to-end communication path traverses multiple entities, it has become increasingly challenging to guarantee communication availability and reliability. Further, most platform operators use a reactive approach to communication issues, responding to user complaints only after incidents have become severe, compromising service quality. This paper presents our experience in the design and deployment of ANCHOR -- an unsupervised anomaly detection solution for the IoT connectivity service of a large global roaming platform. ANCHOR assists engineers by filtering vast amounts of data to identify potential problematic clients (i.e., those with connectivity issues affecting several of their IoT devices), enabling proactive issue resolution before the service is critically impacted. We first describe the IoT service, infrastructure, and network visibility of the IoT connectivity provider we operate. Second, we describe the main challenges and operational requirements for designing an unsupervised anomaly detection solution on this platform. Following these guidelines, we propose different statistical rules, and machine- and deep-learning models for IoT verticals anomaly detection based on passive signaling traffic. We describe the steps we followed working with the operational teams on the design and evaluation of our solution on the operational platform, and report an evaluation on operational IoT customers.