Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of annotated data for biomedical entity linking in low-resource settings and the high deployment cost and instability of closed-source large language models (LLMs), this paper proposes RPDR—a knowledge distillation framework that leverages closed-source LLMs (e.g., GPT-4) to generate high-quality training data. Through prompt engineering, retrieval-augmented generation, and few-shot fine-tuning, RPDR transfers knowledge to lightweight open-source LLMs (e.g., Qwen, Llama), augmented with cross-lingual re-ranking for localized inference. This work pioneers the “closed-source-assisted open-source” distillation paradigm, enabling high-performance, low-cost, and deployable entity linking without extensive manual annotation. Evaluated on the bilingual Aier and Ask A Patient datasets, RPDR achieves Acc@1 improvements of +0.019 and +0.036 over supervised baselines, respectively, demonstrating strong generalization and practical utility.

Technology Category

Application Category

📝 Abstract
Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Distill closed-source LLM knowledge for local biomedical entity linking
Reduce economic costs and stability issues in entity linking
Improve performance in low-resource biomedical data scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining closed-source and open-source LLMs
Generating training data via closed-source LLM
Fine-tuning open-source LLM for local deployment
🔎 Similar Papers
No similar papers found.
Y
Yihao Ai
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Zhiyuan Ning
Zhiyuan Ning
Westlake University
Graph Machine LearningKnowledge GraphsLarge Language Models
Weiwei Dai
Weiwei Dai
Aier Eye Hospital Group
P
Pengfei Wang
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Yi Du
Yi Du
Chinese Academy of Sciences
data miningknowledge engineeringAI for Science
W
Wenjuan Cui
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Kunpeng Liu
Kunpeng Liu
Assistant Professor, Clemson University
Feature EngineeringLLM ReasoningReinforcement Learning
Yuanchun Zhou
Yuanchun Zhou
Computer Network Information Center,CAS
Data MiningBig Data Analysis