BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the challenges of automatically extracting protein–ligand bioactivity data from scientific literature, which involves interpreting biochemical semantics across text, tables, and figures and accurately reconstructing complex chemical structures—particularly Markush representations. To this end, the authors propose BioMiner, a multimodal framework that decouples bioactivity semantic parsing from ligand structure reconstruction: the former is achieved through direct semantic reasoning, while the latter employs a chemistry-aware visual semantic reasoning mechanism, integrating multimodal large language models with specialized cheminformatics tools for precise molecular reconstruction. The work introduces BioVista, a large-scale multimodal benchmark, on which BioMiner achieves an F1 score of 0.32. Applied to 11,683 papers, it extracts 82,262 bioactivity records, improving downstream model performance by 3.9%. Doubling NLRP3 data boosts QSAR model performance by 38.6% and yields 16 novel scaffold candidates. Additionally, BioMiner accelerates annotation on PoseBusters by 5.59× while enhancing accuracy by 5.75%.

Technology Category

Application Category

📝 Abstract

Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.

Problem

Research questions and friction points this paper is trying to address.

protein-ligand bioactivity

automated data extraction

multi-modal mining

chemical structure reconstruction

literature curation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal extraction

chemical structure reconstruction

visual semantic reasoning