🤖 AI Summary
Addressing three key challenges in cross-system log anomaly detection—high labeling costs, dynamic log evolution, and poor generalization—this paper proposes CroSysLog, a log-event-level meta-learning framework. CroSysLog decouples pretraining on source systems from few-shot adaptation to target systems, integrating neural log representation (LSTM/Transformer), MAML-style meta-learning, and temporal-aware log segmentation. Evaluated on four large-scale supercomputing systems—BGL, Liberty, Thunderbird, and Spirit—the framework achieves efficient adaptation using only a handful of labeled log events per target system. Results show an average 12.7% improvement in F1-score over baseline methods, demonstrating substantially enhanced cross-system generalizability and practical deployability.
📝 Abstract
Modern software systems produce vast amounts of logs, serving as an essential resource for anomaly detection. Artificial Intelligence for IT Operations (AIOps) tools have been developed to automate the process of log-based anomaly detection for software systems. Three practical challenges are widely recognized in this field: high data labeling costs, evolving logs in dynamic systems, and adaptability across different systems. In this paper, we propose CroSysLog, an AIOps tool for log-event level anomaly detection, specifically designed in response to these challenges. Following prior approaches, CroSysLog uses a neural representation approach to gain a nuanced understanding of logs and generate representations for individual log events accordingly. CroSysLog can be trained on source systems with sufficient labeled logs from open datasets to achieve robustness, and then efficiently adapt to target systems with a few labeled log events for effective anomaly detection. We evaluate CroSysLog using open datasets of four large-scale distributed supercomputing systems: BGL, Thunderbird, Liberty, and Spirit. We used random log splits, maintaining the chronological order of consecutive log events, from these systems to train and evaluate CroSysLog. These splits were widely distributed across a one/two-year span of each system's log collection duration, thereby capturing the evolving nature of the logs in each system. Our results show that, after training CroSysLog on Liberty and BGL as source systems, CroSysLog can efficiently adapt to target systems Thunderbird and Spirit using a few labeled log events from each target system, effectively performing anomaly detection for these target systems. The results demonstrate that CroSysLog is a practical, scalable, and adaptable tool for log-event level anomaly detection in operational and maintenance contexts of software systems.