How Data Quality Affects Machine Learning Models for Credit Risk Assessment

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the impact of four prevalent data quality issues—missing values, attribute noise, outliers, and label noise—on the predictive performance of machine learning models in credit risk assessment. We propose a controlled, reproducible data quality degradation evaluation framework, leveraging the Pucktrick library for standardized data corruption. Using open-source credit datasets, we conduct a comparative robustness analysis across ten mainstream models, including Random Forest, SVM, and Logistic Regression. Experimental results reveal substantial heterogeneity in model sensitivity to different data defects: certain models exhibit pronounced vulnerability to label noise, while others demonstrate relative resilience. Crucially, improving data quality yields an average AUC gain of 8.2%. This work provides a reusable methodological foundation and benchmark experimental framework for empirical research on data-centric AI in financial risk management.

Technology Category

Application Category

📝 Abstract
Machine Learning (ML) models are being increasingly employed for credit risk evaluation, with their effectiveness largely hinging on the quality of the input data. In this paper we investigate the impact of several data quality issues, including missing values, noisy attributes, outliers, and label errors, on the predictive accuracy of the machine learning model used in credit risk assessment. Utilizing an open-source dataset, we introduce controlled data corruption using the Pucktrick library to assess the robustness of 10 frequently used models like Random Forest, SVM, and Logistic Regression and so on. Our experiments show significant differences in model robustness based on the nature and severity of the data degradation. Moreover, the proposed methodology and accompanying tools offer practical support for practitioners seeking to enhance data pipeline robustness, and provide researchers with a flexible framework for further experimentation in data-centric AI contexts.
Problem

Research questions and friction points this paper is trying to address.

Investigating data quality impact on credit risk ML models
Assessing model robustness against various data corruption types
Providing framework for data-centric AI experimentation in finance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled data corruption using Pucktrick library
Assessed robustness of 10 common ML models
Provided framework for data-centric AI experimentation
🔎 Similar Papers
No similar papers found.