🤖 AI Summary
Feature selection remains challenging in data-scarce or privacy-sensitive settings where raw data are inaccessible and only feature names and task descriptions are available.
Method: We propose LLM-Select, a zero-shot feature selection framework that leverages large language models (e.g., GPT-4) without exposing any training samples or raw data. It employs multi-mechanism prompting and numerical importance scoring to rank features based solely on semantic cues.
Contribution/Results: To our knowledge, this is the first empirical demonstration that LLMs can identify highly predictive features—without training data or data exposure—achieving performance on par with or surpassing traditional methods like LASSO. LLM-Select enables prior-to-data-collection feature prioritization and demonstrates strong generalization across diverse domains and tasks. Extensive evaluation on real-world datasets confirms its effectiveness and robustness in downstream prediction tasks, establishing a novel paradigm for privacy-preserving and data-efficient feature engineering.
📝 Abstract
In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g.,"blood pressure") in predicting an outcome of interest (e.g.,"heart failure"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could benefit practitioners in domains like healthcare and the social sciences, where collecting high-quality data comes at a high cost.