🤖 AI Summary
This study addresses the critical gap in evaluating large language models (LLMs) in non-English clinical settings. We introduce the first cross-lingual medical benchmark grounded in Poland’s national medical licensing and specialty examinations (LEK/LDEK/PES), comprising over 24,000 authentic exam questions and an official English-translated parallel subset. Methodologically, we employ web crawling for data acquisition, triple-layer human verification, and domain-expert translation validation, coupled with a hybrid evaluation framework integrating zero-shot/few-shot prompting, model introspection analysis, and human performance baselines. Key contributions include: (1) establishing the first authoritative, East European clinical licensing–driven multilingual medical evaluation benchmark; (2) pioneering the adaptation of high-stakes professional licensure exams as rigorous LLM capability metrics; and (3) revealing substantial cross-lingual (Polish→English) and cross-specialty (e.g., emergency medicine, pharmacology) performance gaps—highlighting real-world clinical deployment risks. Experiments show GPT-4o approaches medical student proficiency, yet most models remain severely constrained in clinical terminology alignment and deep reasoning.
📝 Abstract
Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.