An Empirical Comparison of Methods for Quantifying the Similarity of Categorical Datasets

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic and neutral comparisons among similarity measures for categorical datasets. It presents the first comprehensive evaluation of several prominent methods—including edge-count tests, constrained minimum distance, graph-based tests, Classifier Two-Sample Tests (C2ST), and the Maximum Mean Discrepancy with Categorical Metrics (MMCM)—assessing their ability to detect distributional differences and their computational costs in both two-sample and multi-sample settings. The results demonstrate that the Friedman–Rafsky test achieves the best overall performance in two-sample tasks, while MMCM excels in multi-sample scenarios by offering both high statistical power and computational efficiency. This work provides empirical evidence and practical guidance for selecting appropriate similarity measures when analyzing categorical data.

Technology Category

Application Category

📝 Abstract
Quantifying the similarity of two or more datasets has widespread applications in statistics and machine learning. The method choice is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies, especially for categorical data. Here, the most promising methods are compared concerning their ability to detect certain differences between datasets and their resource consumption. The results show that the edge count tests perform well when comparing two datasets (i.e., the two-sample case). For certain scenarios, the constrained minimum (CM) distance performs even better. For categorical data consisting of variables with five categories each, the best method depends on the type of difference between the distributions, with either the CM distance and certain graph-based tests performing best, or the classifier-based tests (C2ST). This tendency is even clearer for multiple datasets. Overall, the Friedman-Rafsky test can be recommended for two samples as a compromise of high performance, acceptable resource consumption, and computational error occurrences. For the multi-sample case, the Multi-Sample Mahalanobis Cross-Match (MMCM) test can be recommended due to its comparably good performance and low resource consumption.
Problem

Research questions and friction points this paper is trying to address.

categorical data
dataset similarity
two-sample test
multi-sample comparison
statistical methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

categorical data
two-sample test
multi-sample comparison
graph-based tests
empirical evaluation
🔎 Similar Papers
No similar papers found.
M
Marieke Stolte
Department of Statistics, TU Dortmund University
Jörg Rahnenführer
Jörg Rahnenführer
TU Dortmund University
Statistics
Andrea Bommert
Andrea Bommert
TU Dortmund University