Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the scarcity of annotated data for biomedical entity linking in low-resource settings and the high deployment cost and instability of closed-source large language models (LLMs), this paper proposes RPDR—a knowledge distillation framework that leverages closed-source LLMs (e.g., GPT-4) to generate high-quality training data. Through prompt engineering, retrieval-augmented generation, and few-shot fine-tuning, RPDR transfers knowledge to lightweight open-source LLMs (e.g., Qwen, Llama), augmented with cross-lingual re-ranking for localized inference. This work pioneers the “closed-source-assisted open-source” distillation paradigm, enabling high-performance, low-cost, and deployable entity linking without extensive manual annotation. Evaluated on the bilingual Aier and Ask A Patient datasets, RPDR achieves Acc@1 improvements of +0.019 and +0.036 over supervised baselines, respectively, demonstrating strong generalization and practical utility.

Technology Category

Application Category

📝 Abstract

Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.

Problem

Research questions and friction points this paper is trying to address.

Distill closed-source LLM knowledge for local biomedical entity linking

Reduce economic costs and stability issues in entity linking

Improve performance in low-resource biomedical data scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining closed-source and open-source LLMs

Generating training data via closed-source LLM

Fine-tuning open-source LLM for local deployment

🔎 Similar Papers

No similar papers found.

Authors to Follow