🤖 AI Summary
Biomedical relation classification (RC) suffers from semantic complexity, leading to erroneous predictions that hinder knowledge graph construction and drug repurposing. To address this, we propose an error-aware teacher-student framework: GPT-4o serves as the teacher model to automatically diagnose error types, assess annotation difficulty, and generate sentence rewrites and knowledge-enhanced suggestions. This work is the first to integrate fine-grained error-type analysis with difficulty-aware curriculum learning. We further construct a heterogeneous biomedical knowledge graph to strengthen contextual modeling. Our method comprises instruction tuning, knowledge-graph-guided classification, and sentence-level data augmentation. Evaluated on five protein–protein interaction (PPI) datasets and one drug–drug interaction (DDI) dataset, our approach achieves state-of-the-art performance on four PPI benchmarks and the DDI dataset, while attaining competitive results on ChemProt. It significantly improves model robustness and generalization across diverse biomedical relation extraction tasks.
📝 Abstract
Relation Classification (RC) in biomedical texts is essential for constructing knowledge graphs and enabling applications such as drug repurposing and clinical decision-making. We propose an error-aware teacher--student framework that improves RC through structured guidance from a large language model (GPT-4o). Prediction failures from a baseline student model are analyzed by the teacher to classify error types, assign difficulty scores, and generate targeted remediations, including sentence rewrites and suggestions for KG-based enrichment. These enriched annotations are used to train a first student model via instruction tuning. This model then annotates a broader dataset with difficulty scores and remediation-enhanced inputs. A second student is subsequently trained via curriculum learning on this dataset, ordered by difficulty, to promote robust and progressive learning. We also construct a heterogeneous biomedical knowledge graph from PubMed abstracts to support context-aware RC. Our approach achieves new state-of-the-art performance on 4 of 5 PPI datasets and the DDI dataset, while remaining competitive on ChemProt.