🤖 AI Summary
Addressing three key challenges in biomedical named entity recognition (BioNER)—nested entity identification, ambiguous boundary detection, and poor cross-lingual generalization—this paper proposes a generative, unified modeling framework grounded in large language models (LLMs). Methodologically, it introduces: (1) a symbolic annotation strategy that jointly encodes flat and nested entities within a single sequence labeling scheme; (2) a contrastive learning–based entity selector to enhance boundary discrimination; and (3) a bilingual joint fine-tuning mechanism to strengthen multilingual and multi-task transferability. Evaluated on four mainstream BioNER benchmarks and two zero-shot target-language corpora, the approach achieves state-of-the-art performance across all settings. Notably, it demonstrates exceptional zero-shot cross-lingual transfer capability, outperforming prior methods without language-specific adaptation. This work establishes a scalable, robust, and generative paradigm for BioNER, advancing both modeling flexibility and real-world applicability in low-resource multilingual scenarios.
📝 Abstract
Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at https://github.com/dreamer-tx/LLMNER.