🤖 AI Summary
This work addresses the limitations of existing meta-learning approaches for predicting machine learning pipeline performance (PPE) and estimating dataset similarity (DPSE), which predominantly rely on dataset meta-features while overlooking rich historical experiments and pipeline metadata, thereby failing to effectively model interactions between datasets and pipelines. To overcome this, the study introduces knowledge graph embedding into meta-learning for the first time, constructing a unified knowledge graph that integrates datasets, pipelines, and large-scale experimental results from 144,177 OpenML experiments. By jointly leveraging meta-features and empirical performance records, the proposed method—KGmetaSP—explicitly captures the complex interactions between datasets and pipelines. Remarkably, KGmetaSP achieves substantial improvements in both PPE prediction accuracy and DPSE retrieval effectiveness using a single, general-purpose meta-model, establishing a novel paradigm for cross-dataset meta-learning.
📝 Abstract
The vast collection of machine learning records available on the web presents a significant opportunity for meta-learning, where past experiments are leveraged to improve performance. Two crucial meta-learning tasks are pipeline performance estimation (PPE), which predicts pipeline performance on target datasets, and dataset performance-based similarity estimation (DPSE), which identifies datasets with similar performance patterns. Existing approaches primarily rely on dataset meta-features (e.g., number of instances, class entropy, etc.) to represent datasets numerically and approximate these meta-learning tasks. However, these approaches often overlook the wealth of past experimental results and pipeline metadata available. This limits their ability to capture dataset - pipeline interactions that reveal performance similarity patterns. In this work, we propose KGmetaSP, a knowledge-graph-embeddings approach that leverages existing experiment data to capture these interactions and improve both PPE and DPSE. We represent datasets and pipelines within a unified knowledge graph (KG) and derive embeddings that support pipeline-agnostic meta-models for PPE and distance-based retrieval for DPSE. To validate our approach, we construct a large-scale benchmark comprising 144,177 OpenML experiments, enabling a rich cross-dataset evaluation. KGmetaSP enables accurate PPE using a single pipeline-agnostic meta-model and improves DPSE over baselines. The proposed KGmetaSP, KG, and benchmark are released, establishing a new reference point for meta-learning and demonstrating how consolidating open experiment data into a unified KG advances the field.