🤖 AI Summary
The absence of controllable, reproducible simulation benchmarks for complex high-dimensional structures—including linear/nonlinear dependencies, clustering, and anomalies—hinders rigorous evaluation of machine learning methods. Method: We introduce *cardinalR*, an open-source R package that unifies the generative modeling of diverse high-dimensional structures, including nonlinear manifolds and local anomalies. Its core methodology integrates piecewise polynomial and radial basis function representations to construct flexible nonlinear manifolds, while leveraging Gaussian/t-distribution mixtures to generate clusters and anomalies. All structural properties—including dimensionality, signal-to-noise ratio, and structural strength—are fully parameterized and tunable. Contribution/Results: *cardinalR* provides a standardized, extensible benchmark framework for evaluating dimensionality reduction (e.g., t-SNE, UMAP) and supervised/unsupervised learning algorithms. It significantly enhances model interpretability validation and is accompanied by curated benchmark datasets and comprehensive usage examples.
📝 Abstract
Simulated high-dimensional data is useful for testing, validating, and improving algorithms used in dimension reduction, supervised and unsupervised learning. High-dimensional data is characterized by multiple variables that are dependent or associated in some way, such as linear, nonlinear, clustering or anomalies. Here we provide new methods for generating a variety of high-dimensional structures using mathematical functions and statistical distributions organized into the R package cardinalR. Several example data sets are also provided. These will be useful for researchers to better understand how different analytical methods work and can be improved, with a special focus on nonlinear dimension reduction methods. This package enriches the existing toolset of benchmark datasets for evaluating algorithms.