🤖 AI Summary
Traditional gene set analysis (GSA) neglects clinical context, yielding redundant, nonspecific, and poorly interpretable pathway enrichment results. To address this, we propose cGSA—a context-aware GSA framework that integrates fine-tuned large language models (LLMs) into the GSA pipeline for the first time. cGSA synergistically combines gene co-expression clustering with hypergeometric testing to enable biologically grounded, semantic-level pathway re-ranking. Moving beyond statistical significance alone, it establishes a dual-dimensional interpretability assessment paradigm integrating clinical and biological evidence. Evaluated across 19 diseases and 102 gold-standard gene sets, cGSA achieves >30% average improvement over baseline methods. Expert blind evaluation confirms significantly enhanced pathway precision and mechanistic interpretability. Case studies in melanoma and breast cancer successfully generate experimentally verifiable molecular mechanism hypotheses.
📝 Abstract
Gene set analysis (GSA) is a foundational approach for interpreting genomic data of diseases by linking genes to biological processes. However, conventional GSA methods overlook clinical context of the analyses, often generating long lists of enriched pathways with redundant, nonspecific, or irrelevant results. Interpreting these requires extensive, ad-hoc manual effort, reducing both reliability and reproducibility. To address this limitation, we introduce cGSA, a novel AI-driven framework that enhances GSA by incorporating context-aware pathway prioritization. cGSA integrates gene cluster detection, enrichment analysis, and large language models to identify pathways that are not only statistically significant but also biologically meaningful. Benchmarking on 102 manually curated gene sets across 19 diseases and ten disease-related biological mechanisms shows that cGSA outperforms baseline methods by over 30%, with expert validation confirming its increased precision and interpretability. Two independent case studies in melanoma and breast cancer further demonstrate its potential to uncover context-specific insights and support targeted hypothesis generation.