LLM-based feature generation from text for interpretable machine learning

📅 2024-09-11
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
High-dimensional, opaque text representations (e.g., embeddings, bag-of-words) hinder interpretable rule learning. To address this, we propose a novel framework leveraging LLaMA-2 via prompt engineering to extract a compact set of 62 low-dimensional, semantically transparent, and human-understandable features—such as “methodological rigor” and “novelty”—directly from scientific literature. This marks the first use of large language models to generate *rule-ready*, inherently interpretable textual features. Our pipeline integrates statistical hypothesis testing, supervised classification (Logistic Regression and Random Forest), and rule extraction algorithms to derive actionable, domain-generalizable decision rules. Evaluated on CORD-19 (binary classification) and M17+ (5-class classification), our approach achieves performance comparable to 768-dimensional SciBERT while substantially enhancing model transparency, interpretability, and practical deployability.

Technology Category

Application Category

📝 Abstract
Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. We demonstrate this process on two datasets (CORD-19 and M17+) containing several thousand scientific articles from multiple disciplines and a target being a proxy for research impact. An evaluation based on testing for the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful. We consequently used these generated features in text classification to predict the binary target variable representing the citation rate for the CORD-19 dataset and the ordinal 5-class target representing an expert-awarded grade in the M17+ dataset. Machine-learning models trained on the LLM-generated features provided similar predictive performance to the state-of-the-art embedding model SciBERT for scientific text. The LLM used only 62 features compared to 768 features in SciBERT embeddings, and these features were directly interpretable, corresponding to notions such as article methodological rigor, novelty, or grammatical correctness. As the final step, we extract a small number of well-interpretable action rules. Consistently competitive results obtained with the same LLM feature set across both thematically diverse datasets show that this approach generalizes across domains.
Problem

Research questions and friction points this paper is trying to address.

Generating interpretable text features using LLMs
Addressing high dimensionality in text representations
Enabling rule learning from interpretable features
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated interpretable text features
Small feature set for rule learning
Domain-generalizable action rules extraction
V
Vojtěch Balek
Department of Information and Knowledge Engineering, Prague University of Economics and Business
L
Lukáš Sýkora
Department of Information and Knowledge Engineering, Prague University of Economics and Business
V
Vilém Sklenák
Centre for Information and Library Services, Prague University of Economics and Business, nam W Churchilla 4, Prague, 13067, Czech Republic
Tomáš Kliegr
Tomáš Kliegr
Department of Information and Knowledge Engineering, Prague University of Economics and Business