Utilizing Large Language Models for Machine Learning Explainability

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the capability of large language models (LLMs) to autonomously construct interpretable machine learning (ML) pipelines. Addressing driver alertness prediction and yeast multilabel classification, we systematically prompt GPT, Claude, and DeepSeek to generate end-to-end ML workflows—including Random Forest, XGBoost, MLP, and LSTM—alongside explanation components. We introduce SHAP fidelity and feature sparsity as novel quantitative metrics for evaluating interpretability, the first such application in LLM-driven ML automation. Results demonstrate that LLM-generated models match human-designed baselines in predictive performance (e.g., accuracy, F1-score), while achieving high SHAP fidelity and stable feature sparsity—indicating reliable attribution and concise explanations. These findings establish the feasibility and robustness of LLMs in automating interpretable ML. The study provides both a new paradigm and empirical foundation for trustworthy, automated AI systems.

Technology Category

Application Category

📝 Abstract
This study explores the explainability capabilities of large language models (LLMs), when employed to autonomously generate machine learning (ML) solutions. We examine two classification tasks: (i) a binary classification problem focused on predicting driver alertness states, and (ii) a multilabel classification problem based on the yeast dataset. Three state-of-the-art LLMs (i.e. OpenAI GPT, Anthropic Claude, and DeepSeek) are prompted to design training pipelines for four common classifiers: Random Forest, XGBoost, Multilayer Perceptron, and Long Short-Term Memory networks. The generated models are evaluated in terms of predictive performance (recall, precision, and F1-score) and explainability using SHAP (SHapley Additive exPlanations). Specifically, we measure Average SHAP Fidelity (Mean Squared Error between SHAP approximations and model outputs) and Average SHAP Sparsity (number of features deemed influential). The results reveal that LLMs are capable of producing effective and interpretable models, achieving high fidelity and consistent sparsity, highlighting their potential as automated tools for interpretable ML pipeline generation. The results show that LLMs can produce effective, interpretable pipelines with high fidelity and consistent sparsity, closely matching manually engineered baselines.
Problem

Research questions and friction points this paper is trying to address.

Explores LLMs' ability to generate interpretable machine learning solutions
Evaluates automated ML pipelines on classification tasks using SHAP metrics
Assesses fidelity and sparsity of LLM-generated models versus manual baselines
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs autonomously generate machine learning training pipelines
LLMs design classifiers like Random Forest and XGBoost
LLMs produce interpretable models using SHAP explainability metrics
🔎 Similar Papers
No similar papers found.
A
Alexandros Vassiliades
Centre for Research & Technology, Hellas
N
Nikolaos Polatidis
School of Architecture Technology and Engineering, University of Brighton, BN2 4GJ, United Kingdom
S
Stamatios Samaras
Centre for Research & Technology, Hellas
S
Sotiris Diplaris
Centre for Research & Technology, Hellas
I
Ignacio Cabrera Martin
School of Architecture Technology and Engineering, University of Brighton, BN2 4GJ, United Kingdom
Yannis Manolopoulos
Yannis Manolopoulos
University of Nicosia, 2417 Cyprus
Stefanos Vrochidis
Stefanos Vrochidis
Information Technologies Institute - Centre for Research and Technology Hellas
Multimodal FusionMultimedia RetrievalMultimedia UnderstandingMultimodal AnalyticsArtificial Intelligence
I
Ioannis Kompatsiaris
Centre for Research & Technology, Hellas