Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the limited generalization of models to unseen concepts in biomedical concept recognition, a challenge exacerbated by the scarcity of manually annotated data. To this end, the authors propose the first systematic evaluation framework for Mention-Agnostic Biomedical Concept Recognition (MA-BCR), integrating hierarchical concept structures and novel evaluation metrics. Building upon this framework, they introduce an automated labeling pipeline leveraging large language models (LLMs) to generate synthetic training data. Experimental results demonstrate that while LLM-generated annotations cannot fully replace human-labeled data, they substantially enhance model generalization to unseen concepts and provide broader coverage with structured knowledge support.

Technology Category

Application Category

📝 Abstract

Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at https://github.com/bio-ie-tool/hi-ald.

Problem

Research questions and friction points this paper is trying to address.

unseen concepts

biomedical concept recognition

generalization

data scarcity

mention-agnostic

Innovation

Methods, ideas, or system contributions that make the work stand out.

unseen concept generalization

biomedical concept recognition

LLM-based auto-labeling