🤖 AI Summary
Traditional biomedical knowledge extraction relies heavily on manual curation, resulting in low scalability and efficiency. Method: This study conducts the first systematic evaluation of large language models (LLMs) for genome-scale molecular interaction and pathway knowledge extraction. We integrate BioBERT, LLaMA, and ChatGLM with prompt engineering and supervise fine-tuning using gold-standard databases—including STRING, KEGG, and Reactome—alongside zero-shot inference. Results: Large models significantly outperform smaller ones, achieving moderate performance (F1 ≈ 0.62) on protein–protein interaction identification, radiation-response pathway gene discovery, and gene regulatory relationship parsing. However, critical bottlenecks persist in identifying functionally heterogeneous gene clusters and modeling strongly correlated regulatory relationships. This work provides empirical evidence and methodological guidance for AI-driven, scalable, and automated biological knowledge discovery.
📝 Abstract
Background Identification of the interactions and regulatory relations between biomolecules play pivotal roles in understanding complex biological systems and the mechanisms underlying diverse biological functions. However, the collection of such molecular interactions has heavily relied on expert curation in the past, making it labor-intensive and time-consuming. To mitigate these challenges, we propose leveraging the capabilities of large language models (LLMs) to automate genome-scale extraction of this crucial knowledge. Results In this study, we investigate the efficacy of various LLMs in addressing biological tasks, such as the recognition of protein interactions, identification of genes linked to pathways affected by low-dose radiation, and the delineation of gene regulatory relationships. Overall, the larger models exhibited superior performance, indicating their potential for specific tasks that involve the extraction of complex interactions among genes and proteins. Although these models possessed detailed information for distinct gene and protein groups, they faced challenges in identifying groups with diverse functions and in recognizing highly correlated gene regulatory relationships. Conclusions By conducting a comprehensive assessment of the state-of-the-art models using well-established molecular interaction and pathway databases, our study reveals that LLMs can identify genes/proteins associated with pathways of interest and predict their interactions to a certain extent. Furthermore, these models can provide important insights, marking a noteworthy stride toward advancing our understanding of biological systems through AI-assisted knowledge discovery.