🤖 AI Summary
To address classification errors, claim processing delays, and inefficient manual mapping caused by inconsistencies between institutional procedure names and insurance-standard terminology in the Romanian healthcare system, this paper proposes the first retrieval-based terminology matching framework for Romanian-language clinical text. Methodologically, we systematically evaluate and ensemble Romanian-specific (RoBERTa-base-ro), multilingual (mBERT), and domain-adapted biomedical language models (BioBERT-ro) to construct a sentence-embedding-based semantic similarity pipeline. Evaluated on real-world Romanian medical data, our approach achieves 92.4% Top-1 matching accuracy—substantially outperforming edit-distance and generic word-embedding baselines. This work bridges a critical gap in low-resource language medical terminology alignment and empirically demonstrates the essential role of domain-adapted embeddings for Romanian clinical NLP.
📝 Abstract
Accurately mapping medical procedure names from healthcare providers to standardized terminology used by insurance companies is a crucial yet complex task. Inconsistencies in naming conventions lead to missclasified procedures, causing administrative inefficiencies and insurance claim problems in private healthcare settings. Many companies still use human resources for manual mapping, while there is a clear opportunity for automation. This paper proposes a retrieval-based architecture leveraging sentence embeddings for medical name matching in the Romanian healthcare system. This challenge is significantly more difficult in underrepresented languages such as Romanian, where existing pretrained language models lack domain-specific adaptation to medical text. We evaluate multiple embedding models, including Romanian, multilingual, and medical-domain-specific representations, to identify the most effective solution for this task. Our findings contribute to the broader field of medical NLP for low-resource languages such as Romanian.