Anti-Money Laundering Machine Learning Pipelines; A Technical Analysis on Identifying High-risk Bank Clients with Supervised Learning

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of high-risk customer identification in anti-money laundering (AML) by developing an end-to-end deployable machine learning pipeline. Methodologically, it introduces a 16-step systematic modeling framework that tightly integrates a lightweight SQLite database, SQL-driven automated feature engineering, and an explainable AI (XAI) module—ensuring model traceability, transparency, and production readiness. Its key innovation lies in deeply embedding database operations into the ML workflow and fully SQL-ifying and standardizing the entire pipeline—from feature generation and model training to interpretation. Evaluated on an international AML competition dataset, the pipeline achieves a mean AUROC of 0.961 (σ = 0.005), ranking second among all participants, thereby demonstrating both high predictive accuracy and strong robustness.

Technology Category

Application Category

📝 Abstract
Anti-money laundering (AML) actions and measurements are among the priorities of financial institutions, for which machine learning (ML) has shown to have a high potential. In this paper, we propose a comprehensive and systematic approach for developing ML pipelines to identify high-risk bank clients in a dataset curated for Task 1 of the University of Toronto 2023-2024 Institute for Management and Innovation (IMI) Big Data and Artificial Intelligence Competition. The dataset included 195,789 customer IDs, and we employed a 16-step design and statistical analysis to ensure the final pipeline was robust. We also framed the data in a SQLite database, developed SQL-based feature engineering algorithms, connected our pre-trained model to the database, and made it inference-ready, and provided explainable artificial intelligence (XAI) modules to derive feature importance. Our pipeline achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.961 with a standard deviation (SD) of 0.005. The proposed pipeline achieved second place in the competition.
Problem

Research questions and friction points this paper is trying to address.

Identifying high-risk bank clients using machine learning
Developing robust AML pipelines with supervised learning
Applying explainable AI for feature importance in financial risk
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised learning for high-risk client identification
SQL-based feature engineering algorithms development
Explainable AI modules for feature importance
🔎 Similar Papers
No similar papers found.
K
Khashayar Namdar
Institute of Medical Science, University of Toronto, Toronto, ON, Canada; Vector Institute, Toronto, ON, Canada; NVIDIA Deep Learning Institute, Austin, TX, United States
P
Pin-Chien Wang
Rotman School of Management, University of Toronto, Toronto, ON, Canada
T
Tushar Raju
Rotman School of Management, University of Toronto, Toronto, ON, Canada
Steven Zheng
Steven Zheng
University of Manitoba
F
Fiona Li
Rotman School of Management, University of Toronto, Toronto, ON, Canada
S
Safwat Tahmin Khan
Institute of Biomedical Engineering, University of Toronto, Toronto, ON, Canada