🤖 AI Summary
This study addresses the scarcity of structured, context-rich experimental data in targeted protein degradation (TPD), which has hindered the development of computational models. To overcome this limitation, the authors propose the first expert-in-the-loop large language model (LLM) agent framework tailored for TPD. By integrating lightweight prompt optimization, terminology-aware transfer, and a triangulation-based validation mechanism, the framework automatically extracts multidimensional information—including compounds, targets, recruiters, and critical experimental conditions—from scientific literature. Requiring only minimal annotated data, it achieves high-accuracy cross-task transfer. The resulting molecular glue and PROTAC databases are expanded by 81% and 92%, respectively, with expert-validated accuracy rates of 92% and 82.5%, substantially enhancing condition-aware modeling of degrader activity.
📝 Abstract
Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert-annotated ground truth. A lightweight cross-validated prompt-refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record-level $F_1 = 0.98$ and transferred to PROTACs by terminology substitution alone, maintaining record-level $F_1 > 0.93$. Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay-context information essential for cross-study potency comparison and condition-aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI-assisted scientific curation more broadly.