Khiops: An End-to-End, Frugal AutoML and XAI Machine Learning Solution for Large, Multi-Table Databases

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AutoML and XAI struggle to balance efficiency and interpretability on large-scale multi-table databases (millions of samples, tens of thousands of variables, hundreds of millions of table records). Method: We propose an end-to-end, lightweight Bayesian framework that jointly performs automated propositionalization, numerical discretization, and categorical value clustering to enable automatic feature construction and quantitative variable importance estimation in multi-table settings. Crucially, it unifies variable selection and weight learning within a sparse Bayesian inference framework. Contribution/Results: The system provides both Python API and GUI interfaces and is open-source. Experiments demonstrate sublinear time complexity on datasets with hundreds of millions of records, while achieving high predictive accuracy and strong model interpretability—effectively bridging the gap between scalability and transparency in multi-table AutoML.

Technology Category

Application Category

📝 Abstract
Khiops is an open source machine learning tool designed for mining large multi-table databases. Khiops is based on a unique Bayesian approach that has attracted academic interest with more than 20 publications on topics such as variable selection, classification, decision trees and co-clustering. It provides a predictive measure of variable importance using discretisation models for numerical data and value clustering for categorical data. The proposed classification/regression model is a naive Bayesian classifier incorporating variable selection and weight learning. In the case of multi-table databases, it provides propositionalisation by automatically constructing aggregates. Khiops is adapted to the analysis of large databases with millions of individuals, tens of thousands of variables and hundreds of millions of records in secondary tables. It is available on many environments, both from a Python library and via a user interface.
Problem

Research questions and friction points this paper is trying to address.

Automating machine learning for large multi-table databases
Providing explainable AI through variable importance measures
Handling high-dimensional data with efficient Bayesian methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian approach for variable selection and classification
Discretisation and clustering for predictive variable importance
Automatic propositionalisation for multi-table database aggregates
🔎 Similar Papers
No similar papers found.
M
Marc Boullé
Orange Research
N
Nicolas Voisine
Orange Research
B
Bruno Guerraz
Orange Research
C
Carine Hue
Orange Research
F
Felipe Olmos
Orange Research
V
Vladimir Popescu
Orange Research
S
Stéphane Gouache
Orange Research
S
Stéphane Bouget
Orange Research
Alexis Bondu
Alexis Bondu
Orange Labs
Machine Learningtime seriesclassificationco-clustering
L
Luc Aurelien Gauthier
Orange Research
Y
Yassine Nair Benrekia
Orange Research
F
Fabrice Clérot
Orange Research
V
Vincent Lemaire
Orange Research