🤖 AI Summary
Existing synthetic data generation for clustering benchmarking relies on low-level parameters (e.g., means, covariances), resulting in poor interpretability and inefficient construction. Method: We propose a high-level semantic-driven approach that enables reproducible, interpretable generation of multivariate Gaussian mixture data directly from natural-language descriptions (e.g., “clusters with vastly different shapes”) or abstract geometric parameters (e.g., overlap degree, density ratio, shape type). Contribution/Results: This work establishes the first end-to-end mapping from semantic intent to geometric realization. We implement and open-source a Python toolkit, *repliclust*, supporting key structural dimensions—cluster shape, density, and overlap. An accompanying interactive demonstration system validates the method’s usability and robustness. Our approach significantly improves benchmark design efficiency, transparency, and domain-specific interpretability.
📝 Abstract
Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms. Simulation studies on synthetic data are popular because important features of the data sets, such as the overlap between clusters, or the variation in cluster shapes, can be effectively varied. Unfortunately, creating evaluation scenarios is often laborious, as practitioners must translate higher-level scenario descriptions like"clusters with very different shapes"into lower-level geometric parameters such as cluster centers, covariance matrices, etc. To make benchmarks more convenient and informative, we propose synthetic data generation based on direct specification of high-level scenarios, either through verbal descriptions or high-level geometric parameters. Our open-source Python package repliclust implements this workflow, making it easy to set up interpretable and reproducible benchmarks for cluster analysis. A demo of data generation from verbal inputs is available at https://demo.repliclust.org.