Code Smell Detection via Pearson Correlation and ML Hyperparameter Optimization

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address low accuracy and poor cross-dataset generalizability in code smell detection for large-scale software systems, this paper proposes a machine learning framework integrating data balancing, feature selection, and hyperparameter optimization. Specifically, it employs SMOTE for minority-class oversampling, Pearson correlation–based feature filtering, and three hyperparameter tuning strategies—grid search, random search, and Bayesian optimization—to systematically evaluate eight mainstream classifiers, including XGBoost, AdaBoost, and Random Forest. Experimental results demonstrate that AdaBoost achieves 100% accuracy, while XGBoost and Random Forest attain 99%; all three significantly outperform baseline methods across precision, recall, F1-score, and AUC. This work constitutes the first systematic validation of multi-strategy collaborative optimization in code smell detection, establishing a novel paradigm for high-accuracy, robustly generalizable, and reproducible automated software quality analysis.

Technology Category

Application Category

📝 Abstract

This study addresses the challenge of detecting code smells in large-scale software systems using machine learning (ML). Traditional detection methods often suffer from low accuracy and poor generalization across different datasets. To overcome these issues, we propose a machine learning-based model that automatically and accurately identifies code smells, offering a scalable solution for software quality analysis. The novelty of our approach lies in the use of eight diverse ML algorithms, including XGBoost, AdaBoost, and other classifiers, alongside key techniques such as the Synthetic Minority Over-sampling Technique (SMOTE) for class imbalance and Pearson correlation for efficient feature selection. These methods collectively improve model accuracy and generalization. Our methodology involves several steps: first, we preprocess the data and apply SMOTE to balance the dataset; next, Pearson correlation is used for feature selection to reduce redundancy; followed by training eight ML algorithms and tuning hyperparameters through Grid Search, Random Search, and Bayesian Optimization. Finally, we evaluate the models using accuracy, F-measure, and confusion matrices. The results show that AdaBoost, Random Forest, and XGBoost perform best, achieving accuracies of 100%, 99%, and 99%, respectively. This study provides a robust framework for detecting code smells, enhancing software quality assurance, and demonstrating the effectiveness of a comprehensive, optimized ML approach.

Problem

Research questions and friction points this paper is trying to address.

Detecting code smells in large-scale software systems using machine learning

Addressing low accuracy and poor generalization in traditional detection methods

Optimizing ML models with feature selection and hyperparameter tuning techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Pearson correlation for feature selection

Applies SMOTE technique to handle class imbalance

Optimizes hyperparameters with Grid and Bayesian Search

🔎 Similar Papers

On the correlation between Architectural Smells and Static Analysis Warnings

2024-06-25arXiv.orgCitations: 3

💼 Related Jobs

Software Engineer, Machine Learning