🤖 AI Summary
Highly variable event logs yield overly complex and poorly interpretable process models via automated discovery, while existing trace clustering methods largely neglect the probabilistic nature of activities and transitions, failing to capture real execution dynamics. This paper proposes a model-driven stochastic trace clustering method: grounded in stochastic process models, it introduces an entropy-based correlation measure derived from direct-follows probabilities and jointly optimizes trace assignment via structural alignment and generative likelihood. An efficient iterative algorithm ensures linear scalability. To our knowledge, this is the first approach to unify stochastic modeling with model-driven optimization in trace clustering, significantly enhancing control-flow pattern clarity and clustering quality. Extensive evaluation on multiple real-world datasets demonstrates superior behavioral representation accuracy and clustering stability over state-of-the-art methods, and reveals systematic effects of stochasticity on clustering performance ranking.
📝 Abstract
Process discovery algorithms automatically extract process models from event logs, but high variability often results in complex and hard-to-understand models. To mitigate this issue, trace clustering techniques group process executions into clusters, each represented by a simpler and more understandable process model. Model-driven trace clustering improves on this by assigning traces to clusters based on their conformity to cluster-specific process models. However, most existing clustering techniques rely on either no process model discovery, or non-stochastic models, neglecting the frequency or probability of activities and transitions, thereby limiting their capability to capture real-world execution dynamics. We propose a novel model-driven trace clustering method that optimizes stochastic process models within each cluster. Our approach uses entropic relevance, a stochastic conformance metric based on directly-follows probabilities, to guide trace assignment. This allows clustering decisions to consider both structural alignment with a cluster's process model and the likelihood that a trace originates from a given stochastic process model. The method is computationally efficient, scales linearly with input size, and improves model interpretability by producing clusters with clearer control-flow patterns. Extensive experiments on public real-life datasets show that our method outperforms existing alternatives in representing process behavior and reveals how clustering performance rankings can shift when stochasticity is considered.