StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses genotype–phenotype association analysis for complex traits in large-scale genomic data, proposing the first multi-objective genetic programming AutoML framework integrated with genetic prior knowledge. Methodologically, it innovatively incorporates nine canonical genetic models, LD-aware pruning nodes, and a dynamic mutation priority recommendation system into a unified pipeline-based evolutionary process, enabling Pareto-optimal modeling under biologically informed constraints. Evaluated on the brown rat BMI dataset, the framework substantially outperforms baseline methods: it identifies both known and novel QTLs, achieves higher r² values on the Pareto front, and yields more parsimonious models—reducing feature count by 37% on average—thereby enhancing model interpretability and experimental verifiability.

Technology Category

Application Category

📝 Abstract
We present the Star-Based Automated Single-locus and Epistasis analysis tool - Genetic Programming (StarBASE-GP), an automated framework for discovering meaningful genetic variants associated with phenotypic variation in large-scale genomic datasets. StarBASE-GP uses a genetic programming-based multi-objective optimization strategy to evolve machine learning pipelines that simultaneously maximize explanatory power (r2) and minimize pipeline complexity. Biological domain knowledge is integrated at multiple stages, including the use of nine inheritance encoding strategies to model deviations from additivity, a custom linkage disequilibrium pruning node that minimizes redundancy among features, and a dynamic variant recommendation system that prioritizes informative candidates for pipeline inclusion. We evaluate StarBASE-GP on a cohort of Rattus norvegicus (brown rat) to identify variants associated with body mass index, benchmarking its performance against a random baseline and a biologically naive version of the tool. StarBASE-GP consistently evolves Pareto fronts with superior performance, yielding higher accuracy in identifying both ground truth and novel quantitative trait loci, highlighting relevant targets for future validation. By incorporating evolutionary search and relevant biological theory into a flexible automated machine learning framework, StarBASE-GP demonstrates robust potential for advancing variant discovery in complex traits.
Problem

Research questions and friction points this paper is trying to address.

Automated discovery of genetic variants linked to phenotypes
Multi-objective optimization for ML pipelines in genomics
Integration of biological knowledge to enhance variant analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Genetic programming-based multi-objective optimization
Biological domain knowledge integration
Dynamic variant recommendation system
🔎 Similar Papers
No similar papers found.
J
Jose Guadalupe Hernandez
Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, 90069, USA
A
Attri Ghosh
Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, 90069, USA
P
P. Freda
Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, 90069, USA
Y
Yufei Meng
N
Nicholas Matsumoto
Jason H. Moore
Jason H. Moore
Chair, Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA
Artificial IntelligenceMachine LearningBiomedical InformaticsPrecision MedicineTranslational Bioinformatics