Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

AutoML and XAI struggle to balance efficiency and interpretability on large-scale multi-table databases (millions of samples, tens of thousands of variables, hundreds of millions of table records). Method: We propose an end-to-end, lightweight Bayesian framework that jointly performs automated propositionalization, numerical discretization, and categorical value clustering to enable automatic feature construction and quantitative variable importance estimation in multi-table settings. Crucially, it unifies variable selection and weight learning within a sparse Bayesian inference framework. Contribution/Results: The system provides both Python API and GUI interfaces and is open-source. Experiments demonstrate sublinear time complexity on datasets with hundreds of millions of records, while achieving high predictive accuracy and strong model interpretability—effectively bridging the gap between scalability and transparency in multi-table AutoML.

Technology Category

Application Category

📝 Abstract

Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.

Problem

Research questions and friction points this paper is trying to address.

Automating machine learning for large multi-table databases

Providing explainable AI through variable importance measures

Handling high-dimensional data with efficient Bayesian methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian approach for variable selection and classification

Discretisation and clustering for predictive variable importance

Automatic propositionalisation for multi-table database aggregates

🔎 Similar Papers

No similar papers found.