Knowledge-guided Contextual Gene Set Analysis Using Large Language Models

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Traditional gene set analysis (GSA) neglects clinical context, yielding redundant, nonspecific, and poorly interpretable pathway enrichment results. To address this, we propose cGSA—a context-aware GSA framework that integrates fine-tuned large language models (LLMs) into the GSA pipeline for the first time. cGSA synergistically combines gene co-expression clustering with hypergeometric testing to enable biologically grounded, semantic-level pathway re-ranking. Moving beyond statistical significance alone, it establishes a dual-dimensional interpretability assessment paradigm integrating clinical and biological evidence. Evaluated across 19 diseases and 102 gold-standard gene sets, cGSA achieves >30% average improvement over baseline methods. Expert blind evaluation confirms significantly enhanced pathway precision and mechanistic interpretability. Case studies in melanoma and breast cancer successfully generate experimentally verifiable molecular mechanism hypotheses.

Technology Category

Application Category

📝 Abstract

Gene set analysis (GSA) is a foundational approach for interpreting genomic data of diseases by linking genes to biological processes. However, conventional GSA methods overlook clinical context of the analyses, often generating long lists of enriched pathways with redundant, nonspecific, or irrelevant results. Interpreting these requires extensive, ad-hoc manual effort, reducing both reliability and reproducibility. To address this limitation, we introduce cGSA, a novel AI-driven framework that enhances GSA by incorporating context-aware pathway prioritization. cGSA integrates gene cluster detection, enrichment analysis, and large language models to identify pathways that are not only statistically significant but also biologically meaningful. Benchmarking on 102 manually curated gene sets across 19 diseases and ten disease-related biological mechanisms shows that cGSA outperforms baseline methods by over 30%, with expert validation confirming its increased precision and interpretability. Two independent case studies in melanoma and breast cancer further demonstrate its potential to uncover context-specific insights and support targeted hypothesis generation.

Problem

Research questions and friction points this paper is trying to address.

Conventional GSA lacks clinical context, producing redundant results

Manual interpretation of GSA results is unreliable and unreproducible

Need for context-aware pathway prioritization in genomic data analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates context-aware pathway prioritization

Integrates gene cluster detection and LLMs

Enhances precision and interpretability significantly

🔎 Similar Papers

No similar papers found.