Pre-Hoc Predictions in AutoML: Leveraging LLMs to Enhance Model Selection and Benchmarking for Tabular datasets

๐Ÿ“… 2025-10-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing AutoML approaches rely on exhaustive hyperparameter search, incurring prohibitive computational overhead; while prior-model prediction holds promise, it remains underexplored. This paper proposes a pre-selection paradigm that synergistically integrates traditional machine learning with large language model (LLM) agents, leveraging dataset meta-features and statistical descriptors to intelligently identify high-potential candidate modelsโ€”thereby substantially shrinking the search space of AutoML frameworks (e.g., AutoGluon). Its key contribution is the first systematic incorporation of LLM agents into tabular data model pre-selection, enabling joint semantic understanding and structured meta-feature modeling. Evaluated on a benchmark of 175 classification datasets, our method achieves an average 62% reduction in training time while preserving 98.3% accuracy in identifying the optimal model, demonstrating strong trade-offs between efficiency and performance.

Technology Category

Application Category

๐Ÿ“ Abstract
The field of AutoML has made remarkable progress in post-hoc model selection, with libraries capable of automatically identifying the most performing models for a given dataset. Nevertheless, these methods often rely on exhaustive hyperparameter searches, where methods automatically train and test different types of models on the target dataset. Contrastingly, pre-hoc prediction emerges as a promising alternative, capable of bypassing exhaustive search through intelligent pre-selection of models. Despite its potential, pre-hoc prediction remains under-explored in the literature. This paper explores the intersection of AutoML and pre-hoc model selection by leveraging traditional models and Large Language Model (LLM) agents to reduce the search space of AutoML libraries. By relying on dataset descriptions and statistical information, we reduce the AutoML search space. Our methodology is applied to the AWS AutoGluon portfolio dataset, a state-of-the-art AutoML benchmark containing 175 tabular classification datasets available on OpenML. The proposed approach offers a shift in AutoML workflows, significantly reducing computational overhead, while still selecting the best model for the given dataset.
Problem

Research questions and friction points this paper is trying to address.

Explores pre-hoc model selection in AutoML for tabular datasets
Leverages LLMs and dataset descriptions to reduce search space
Aims to reduce computational overhead while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs for pre-hoc model selection
Reducing AutoML search space using dataset descriptions
Applying pre-hoc prediction to tabular classification datasets
๐Ÿ”Ž Similar Papers