Knowledge-guided Contextual Gene Set Analysis Using Large Language Models

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional gene set analysis (GSA) neglects clinical context, yielding redundant, nonspecific, and poorly interpretable pathway enrichment results. To address this, we propose cGSA—a context-aware GSA framework that integrates fine-tuned large language models (LLMs) into the GSA pipeline for the first time. cGSA synergistically combines gene co-expression clustering with hypergeometric testing to enable biologically grounded, semantic-level pathway re-ranking. Moving beyond statistical significance alone, it establishes a dual-dimensional interpretability assessment paradigm integrating clinical and biological evidence. Evaluated across 19 diseases and 102 gold-standard gene sets, cGSA achieves >30% average improvement over baseline methods. Expert blind evaluation confirms significantly enhanced pathway precision and mechanistic interpretability. Case studies in melanoma and breast cancer successfully generate experimentally verifiable molecular mechanism hypotheses.

Technology Category

Application Category

📝 Abstract
Gene set analysis (GSA) is a foundational approach for interpreting genomic data of diseases by linking genes to biological processes. However, conventional GSA methods overlook clinical context of the analyses, often generating long lists of enriched pathways with redundant, nonspecific, or irrelevant results. Interpreting these requires extensive, ad-hoc manual effort, reducing both reliability and reproducibility. To address this limitation, we introduce cGSA, a novel AI-driven framework that enhances GSA by incorporating context-aware pathway prioritization. cGSA integrates gene cluster detection, enrichment analysis, and large language models to identify pathways that are not only statistically significant but also biologically meaningful. Benchmarking on 102 manually curated gene sets across 19 diseases and ten disease-related biological mechanisms shows that cGSA outperforms baseline methods by over 30%, with expert validation confirming its increased precision and interpretability. Two independent case studies in melanoma and breast cancer further demonstrate its potential to uncover context-specific insights and support targeted hypothesis generation.
Problem

Research questions and friction points this paper is trying to address.

Conventional GSA lacks clinical context, producing redundant results
Manual interpretation of GSA results is unreliable and unreproducible
Need for context-aware pathway prioritization in genomic data analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates context-aware pathway prioritization
Integrates gene cluster detection and LLMs
Enhances precision and interpretability significantly
🔎 Similar Papers
No similar papers found.
Zhizheng Wang
Zhizheng Wang
Postdoc, Division of Intramural Research (DIR), NLM, NIH
Large Language ModelsRepresentation LearningGraph Data MiningBioinformatics
C
Chi-Ping Day
Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute (NCI), National Institutes of Health (NIH); Bethesda, MD 20894, USA
C
Chih-Hsuan Wei
Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH); Bethesda, MD 20894, USA
Q
Qiao Jin
Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH); Bethesda, MD 20894, USA
Robert Leaman
Robert Leaman
Staff Scientist, NCBI/NLM/NIH
Natural Language ProcessingMachine Learning
Y
Yifan Yang
Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH); Bethesda, MD 20894, USA
S
Shubo Tian
Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH); Bethesda, MD 20894, USA
A
Aodong Qiu
Department of Biomedical Informatics, School of Medicine, University of Pittsburgh; Pittsburgh, PA 15206, USA
Yin Fang
Yin Fang
National Institutes of Health
AI4BioinformaticsKnowledge GraphLanguage Model
Qingqing Zhu
Qingqing Zhu
nih
X
Xinghua Lu
Department of Biomedical Informatics, School of Medicine, University of Pittsburgh; Pittsburgh, PA 15206, USA
Zhiyong Lu
Zhiyong Lu
Senior Investigator, NLM; Adjunct Professor of CS, UIUC
BioNLPBiomedical InformaticsMedical AIArtificial Intelligence