Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge

📅 2023-07-17

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Traditional biomedical knowledge extraction relies heavily on manual curation, resulting in low scalability and efficiency. Method: This study conducts the first systematic evaluation of large language models (LLMs) for genome-scale molecular interaction and pathway knowledge extraction. We integrate BioBERT, LLaMA, and ChatGLM with prompt engineering and supervise fine-tuning using gold-standard databases—including STRING, KEGG, and Reactome—alongside zero-shot inference. Results: Large models significantly outperform smaller ones, achieving moderate performance (F1 ≈ 0.62) on protein–protein interaction identification, radiation-response pathway gene discovery, and gene regulatory relationship parsing. However, critical bottlenecks persist in identifying functionally heterogeneous gene clusters and modeling strongly correlated regulatory relationships. This work provides empirical evidence and methodological guidance for AI-driven, scalable, and automated biological knowledge discovery.

📝 Abstract

Background Identification of the interactions and regulatory relations between biomolecules play pivotal roles in understanding complex biological systems and the mechanisms underlying diverse biological functions. However, the collection of such molecular interactions has heavily relied on expert curation in the past, making it labor-intensive and time-consuming. To mitigate these challenges, we propose leveraging the capabilities of large language models (LLMs) to automate genome-scale extraction of this crucial knowledge. Results In this study, we investigate the efficacy of various LLMs in addressing biological tasks, such as the recognition of protein interactions, identification of genes linked to pathways affected by low-dose radiation, and the delineation of gene regulatory relationships. Overall, the larger models exhibited superior performance, indicating their potential for specific tasks that involve the extraction of complex interactions among genes and proteins. Although these models possessed detailed information for distinct gene and protein groups, they faced challenges in identifying groups with diverse functions and in recognizing highly correlated gene regulatory relationships. Conclusions By conducting a comprehensive assessment of the state-of-the-art models using well-established molecular interaction and pathway databases, our study reveals that LLMs can identify genes/proteins associated with pathways of interest and predict their interactions to a certain extent. Furthermore, these models can provide important insights, marking a noteworthy stride toward advancing our understanding of biological systems through AI-assisted knowledge discovery.

Problem

Research questions and friction points this paper is trying to address.

Automate genome-scale extraction of molecular interactions using LLMs

Evaluate LLM performance in recognizing protein interactions and gene pathways

Assess LLM capabilities in predicting gene regulatory relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging large language models for genome-scale extraction

Evaluating LLMs on protein interaction recognition

Assessing gene regulatory relationship prediction accuracy

🔎 Similar Papers

No similar papers found.