LeakageDetector: An Open Source Data Leakage Analysis Tool in Machine Learning Pipelines

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data leakage—caused by coding errors leading to model overfitting and poor generalization—poses a critical risk in machine learning engineering. This paper introduces LeakageDetector, the first IDE-embedded plugin for PyCharm that enables real-time, static detection of data leakage during coding, coupled with context-aware remediation suggestions. Methodologically, it integrates Python Abstract Syntax Tree (AST) parsing, rule-based pattern matching, and semantic modeling tailored to scikit-learn, TensorFlow, and PyTorch ML pipelines, systematically identifying seven canonical leakage patterns. Its key contribution is the unprecedented shift of leakage detection into the development environment, enabling “detect-as-you-code” and “fix-as-you-detect.” Empirical evaluation demonstrates that LeakageDetector reduces the incidence of data leakage by over 90%, substantially enhancing the robustness and reliability of ML engineering practices.

Technology Category

Application Category

📝 Abstract
Code quality is of paramount importance in all types of software development settings. Our work seeks to enable Machine Learning (ML) engineers to write better code by helping them find and fix instances of Data Leakage in their models. Data Leakage often results from bad practices in writing ML code. As a result, the model effectively ''memorizes'' the data on which it trains, leading to an overly optimistic estimate of the model performance and an inability to make generalized predictions. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. Training data should be used to train the model, evaluation data should be used to repeatedly confirm a model's accuracy, and test data should be used only once to determine the accuracy of a production-ready model. In this paper, we develop LEAKAGEDETECTOR, a Python plugin for the PyCharm IDE that identifies instances of Data Leakage in ML code and provides suggestions on how to remove the leakage.
Problem

Research questions and friction points this paper is trying to address.

Identifies and fixes Data Leakage in ML pipelines
Prevents model overfitting by detecting bad coding practices
Ensures proper data separation for training, evaluation, and testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source Python plugin for PyCharm IDE
Detects Data Leakage in ML pipelines
Provides suggestions to fix leakage issues
🔎 Similar Papers
2024-07-21Conference on Computer and Communications SecurityCitations: 2
Eman Abdullah AlOmar
Eman Abdullah AlOmar
Stevens Institute of Technology
Software EngineeringSoftware QualityRefactoringArtificial IntelligenceLarge Language Models
C
Catherine DeMario
Stevens Institute of Technology, Hoboken, New Jersey, USA
R
Roger Shagawat
Stevens Institute of Technology, Hoboken, New Jersey, USA
B
Brandon Kreiser
Stevens Institute of Technology, Hoboken, New Jersey, USA