🤖 AI Summary
This study addresses the challenge of high-risk customer identification in anti-money laundering (AML) by developing an end-to-end deployable machine learning pipeline. Methodologically, it introduces a 16-step systematic modeling framework that tightly integrates a lightweight SQLite database, SQL-driven automated feature engineering, and an explainable AI (XAI) module—ensuring model traceability, transparency, and production readiness. Its key innovation lies in deeply embedding database operations into the ML workflow and fully SQL-ifying and standardizing the entire pipeline—from feature generation and model training to interpretation. Evaluated on an international AML competition dataset, the pipeline achieves a mean AUROC of 0.961 (σ = 0.005), ranking second among all participants, thereby demonstrating both high predictive accuracy and strong robustness.
📝 Abstract
Anti-money laundering (AML) actions and measurements are among the priorities of financial institutions, for which machine learning (ML) has shown to have a high potential. In this paper, we propose a comprehensive and systematic approach for developing ML pipelines to identify high-risk bank clients in a dataset curated for Task 1 of the University of Toronto 2023-2024 Institute for Management and Innovation (IMI) Big Data and Artificial Intelligence Competition. The dataset included 195,789 customer IDs, and we employed a 16-step design and statistical analysis to ensure the final pipeline was robust. We also framed the data in a SQLite database, developed SQL-based feature engineering algorithms, connected our pre-trained model to the database, and made it inference-ready, and provided explainable artificial intelligence (XAI) modules to derive feature importance. Our pipeline achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.961 with a standard deviation (SD) of 0.005. The proposed pipeline achieved second place in the competition.