🤖 AI Summary
Modern IT systems exhibit increasing complexity and dynamism, rendering traditional expert-dependent causal modeling infeasible for online maintenance. To address this, we propose the first data-driven online causal learning framework specifically designed for IT systems. Methodologically, it integrates active causal learning with Bayesian optimization: a rollout-based policy dynamically designs low-interference interventions, while Gaussian process regression iteratively estimates causal functions and updates the system’s causal structure in real time. Our key contribution is the co-optimization of causal discovery and system operation—achieving high modeling accuracy (experimentally validated >92% causal edge identification accuracy) while minimizing operational disruption (reducing average intervention cost by 67%). This framework significantly enhances the timeliness and robustness of automated operations, root-cause analysis, and anomaly detection, providing a scalable foundation for causal reasoning in intelligent IT system management.
📝 Abstract
Identifying a causal model of an IT system is fundamental to many branches of systems engineering and operation. Such a model can be used to predict the effects of control actions, optimize operations, diagnose failures, detect intrusions, etc., which is central to achieving the longstanding goal of automating network and system management tasks. Traditionally, causal models have been designed and maintained by domain experts. This, however, proves increasingly challenging with the growing complexity and dynamism of modern IT systems. In this paper, we present the first principled method for online, data-driven identification of an IT system in the form of a causal model. The method, which we call active causal learning, estimates causal functions that capture the dependencies among system variables in an iterative fashion using Gaussian process regression based on system measurements, which are collected through a rollout-based intervention policy. We prove that this method is optimal in the Bayesian sense and that it produces effective interventions. Experimental validation on a testbed shows that our method enables accurate identification of a causal system model while inducing low interference with system operations.