🤖 AI Summary
In process mining, existing neural-network-based approaches—such as word2vec and autoencoders—for modeling activity distribution similarity suffer from high computational overhead and poor interpretability. To address these limitations, we propose a lightweight, interpretable, count-driven embedding method: it constructs an activity co-occurrence matrix directly from event logs, then applies dimensionality reduction and similarity metrics to learn semantic activity relationships—bypassing end-to-end training. We design a comprehensive evaluation benchmark that jointly assesses representation quality, downstream task performance (e.g., process discovery and anomaly detection), and computational efficiency. Experiments on multiple real-world datasets demonstrate that our method matches or surpasses state-of-the-art neural approaches in accuracy, while reducing training time by one to two orders of magnitude. Moreover, it significantly enhances model interpretability and practical deployability.
📝 Abstract
To obtain insights from event data, advanced process mining methods assess the similarity of activities to incorporate their semantic relations into the analysis. Here, distributional similarity that captures similarity from activity co-occurrences is commonly employed. However, existing work for distributional similarity in process mining adopt neural network-based approaches as developed for natural language processing, e.g., word2vec and autoencoders. While these approaches have been shown to be effective, their downsides are high computational costs and limited interpretability of the learned representations.
In this work, we argue for simplicity in the modeling of distributional similarity of activities. We introduce count-based embeddings that avoid a complex training process and offer a direct interpretable representation. To underpin our call for simple embeddings, we contribute a comprehensive benchmarking framework, which includes means to assess the intrinsic quality of embeddings, their performance in downstream applications, and their computational efficiency. In experiments that compare against the state of the art, we demonstrate that count-based embeddings provide a highly effective and efficient basis for distributional similarity between activities in event data.