🤖 AI Summary
Early detection of hard-to-diagnose cancers (e.g., pancreatic cancer) via liquid biopsy faces challenges from high-dimensional, small-sample, and severely class-imbalanced data, leading to poor classification robustness and prohibitively expensive hyperparameter optimization. Method: We propose a novel ensemble learning framework integrating a pretrained Hyperfast meta-model with XGBoost and LightGBM, coupled with PCA-based dimensionality reduction (retaining only 500 features) to mitigate dimensionality dependence and eliminate exhaustive hyperparameter search. Contribution/Results: The framework achieves an AUC of 0.9929 in binary classification and an accuracy of 0.9464 in multiclass classification—significantly outperforming SVM and random forests—while maintaining strong robustness under extreme class imbalance. To our knowledge, this is the first application of Hyperfast to biomarker classification, offering an efficient, interpretable, and plug-and-play solution for low-resource, high-noise clinical datasets.
📝 Abstract
Certain cancer types, namely pancreatic cancer is difficult to detect at an early stage; sparking the importance of discovering the causal relationship between biomarkers and cancer to identify cancer efficiently. By allowing for the detection and monitoring of specific biomarkers through a non-invasive method, liquid biopsies enhance the precision and efficacy of medical interventions, advocating the move towards personalized healthcare. Several machine learning algorithms such as Random Forest, SVM are utilized for classification, yet causing inefficiency due to the need for conducting hyperparameter tuning. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results.