🤖 AI Summary
This study addresses the clinical challenge of insufficient accuracy in pathological staging of prostate cancer (PCa), which impairs treatment decision-making. Leveraging RNA-seq data from 486 PCa patients in The Cancer Genome Atlas (TCGA), we systematically developed and evaluated multiple machine learning and deep learning models. Our methodology integrates advanced feature engineering—including principal component analysis (PCA) for dimensionality reduction—and novel data augmentation strategies. Notably, this is the first comprehensive performance comparison of Random Forest, XGBoost, support vector machine (SVM), logistic regression, and deep neural networks specifically for PCa staging. Results demonstrate that Random Forest achieves the highest test F1-score (83%), while data-augmented deep learning models attain 71.23% accuracy—significantly outperforming conventional clinical staging criteria. This work establishes a reproducible, genomics-driven framework for precise pathological staging and provides an empirical benchmark for future studies.
📝 Abstract
Prostate cancer (Pca) continues to be a leading cause of cancer-related mortality in men, and the limitations in precision of traditional diagnostic methods such as the Digital Rectal Exam (DRE), Prostate-Specific Antigen (PSA) testing, and biopsies underscore the critical importance of accurate staging detection in enhancing treatment outcomes and improving patient prognosis. This study leverages machine learning and deep learning approaches, along with feature selection and extraction methods, to enhance PCa pathological staging predictions using RNA sequencing data from The Cancer Genome Atlas (TCGA). Gene expression profiles from 486 tumors were analyzed using advanced algorithms, including Random Forest (RF), Logistic Regression (LR), Extreme Gradient Boosting (XGB), and Support Vector Machine (SVM). The performance of the study is measured with respect to the F1-score, as well as precision and recall, all of which are calculated as weighted averages. The results reveal that the highest test F1-score, approximately 83%, was achieved by the Random Forest algorithm, followed by Logistic Regression at 80%, while both Extreme Gradient Boosting (XGB) and Support Vector Machine (SVM) scored around 79%. Furthermore, deep learning models with data augmentation achieved an accuracy of 71. 23%, while PCA-based dimensionality reduction reached an accuracy of 69.86%. This research highlights the potential of AI-driven approaches in clinical oncology, paving the way for more reliable diagnostic tools that can ultimately improve patient outcomes.