🤖 AI Summary
To address the scarcity of annotated data for biomedical entity linking in low-resource settings and the high deployment cost and instability of closed-source large language models (LLMs), this paper proposes RPDR—a knowledge distillation framework that leverages closed-source LLMs (e.g., GPT-4) to generate high-quality training data. Through prompt engineering, retrieval-augmented generation, and few-shot fine-tuning, RPDR transfers knowledge to lightweight open-source LLMs (e.g., Qwen, Llama), augmented with cross-lingual re-ranking for localized inference. This work pioneers the “closed-source-assisted open-source” distillation paradigm, enabling high-performance, low-cost, and deployable entity linking without extensive manual annotation. Evaluated on the bilingual Aier and Ask A Patient datasets, RPDR achieves Acc@1 improvements of +0.019 and +0.036 over supervised baselines, respectively, demonstrating strong generalization and practical utility.
📝 Abstract
Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.