Anti-Money Laundering Machine Learning Pipelines; A Technical Analysis on Identifying High-risk Bank Clients with Supervised Learning

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the challenge of high-risk customer identification in anti-money laundering (AML) by developing an end-to-end deployable machine learning pipeline. Methodologically, it introduces a 16-step systematic modeling framework that tightly integrates a lightweight SQLite database, SQL-driven automated feature engineering, and an explainable AI (XAI) module—ensuring model traceability, transparency, and production readiness. Its key innovation lies in deeply embedding database operations into the ML workflow and fully SQL-ifying and standardizing the entire pipeline—from feature generation and model training to interpretation. Evaluated on an international AML competition dataset, the pipeline achieves a mean AUROC of 0.961 (σ = 0.005), ranking second among all participants, thereby demonstrating both high predictive accuracy and strong robustness.

Technology Category

Application Category

📝 Abstract

Anti-money laundering (AML) actions and measurements are among the priorities of financial institutions, for which machine learning (ML) has shown to have a high potential. In this paper, we propose a comprehensive and systematic approach for developing ML pipelines to identify high-risk bank clients in a dataset curated for Task 1 of the University of Toronto 2023-2024 Institute for Management and Innovation (IMI) Big Data and Artificial Intelligence Competition. The dataset included 195,789 customer IDs, and we employed a 16-step design and statistical analysis to ensure the final pipeline was robust. We also framed the data in a SQLite database, developed SQL-based feature engineering algorithms, connected our pre-trained model to the database, and made it inference-ready, and provided explainable artificial intelligence (XAI) modules to derive feature importance. Our pipeline achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.961 with a standard deviation (SD) of 0.005. The proposed pipeline achieved second place in the competition.

Problem

Research questions and friction points this paper is trying to address.

Identifying high-risk bank clients using machine learning

Developing robust AML pipelines with supervised learning

Applying explainable AI for feature importance in financial risk

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised learning for high-risk client identification

SQL-based feature engineering algorithms development

Explainable AI modules for feature importance

🔎 Similar Papers

Network Analytics for Anti-Money Laundering - A Systematic Literature Review and Experimental Evaluation