Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment

📅 2024-11-30

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the critical gap in evaluating large language models (LLMs) in non-English clinical settings. We introduce the first cross-lingual medical benchmark grounded in Poland’s national medical licensing and specialty examinations (LEK/LDEK/PES), comprising over 24,000 authentic exam questions and an official English-translated parallel subset. Methodologically, we employ web crawling for data acquisition, triple-layer human verification, and domain-expert translation validation, coupled with a hybrid evaluation framework integrating zero-shot/few-shot prompting, model introspection analysis, and human performance baselines. Key contributions include: (1) establishing the first authoritative, East European clinical licensing–driven multilingual medical evaluation benchmark; (2) pioneering the adaptation of high-stakes professional licensure exams as rigorous LLM capability metrics; and (3) revealing substantial cross-lingual (Polish→English) and cross-specialty (e.g., emergency medicine, pharmacology) performance gaps—highlighting real-world clinical deployment risks. Experiments show GPT-4o approaches medical student proficiency, yet most models remain severely constrained in clinical terminology alignment and deep reasoning.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on Polish medical exam benchmarks

Assessing cross-lingual medical knowledge transfer challenges

Comparing AI and human performance in non-English medical contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Web-scraped Polish medical exam dataset

Evaluated multilingual LLMs cross-lingual performance

Compared AI against human medical students

🔎 Similar Papers

No similar papers found.