CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models (LLMs) suffer from insufficient trustworthiness evaluation in low- and medium-resource languages within clinical settings, hindering their global deployment. To address this gap, we introduce CLINIC—the first multilingual clinical trustworthiness benchmark covering 15 languages across five continents—and propose a five-dimensional evaluation framework assessing factual accuracy, fairness, safety, robustness, and privacy, rigorously aligned with real-world clinical requirements. Leveraging multilingual prompt engineering, adversarial attacks, bias quantification, privacy leakage detection, and cross-lingual consistency analysis, CLINIC systematically evaluates leading LLMs across disease identification, diagnosis, and treatment tasks. Experimental results reveal significant cross-lingual degradation in factual correctness, demographic fairness, privacy preservation, and adversarial robustness—particularly for under-resourced languages. CLINIC establishes a reproducible, standardized evaluation protocol and actionable improvement pathways for the safe, equitable, and globally scalable deployment of medical AI.

Technology Category

Application Category

📝 Abstract

Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

Problem

Research questions and friction points this paper is trying to address.

Evaluates trustworthiness of language models in multilingual healthcare settings

Addresses lack of reliable evaluation for LMs in mid- and low-resource languages

Benchmarks LMs across truthfulness, fairness, safety, robustness, and privacy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive multilingual benchmark for healthcare trustworthiness evaluation

Systematically assesses five trustworthiness dimensions across 18 tasks

Covers 15 languages and diverse healthcare topics globally

🔎 Similar Papers

XTRUST: On the Multilingual Trustworthiness of Large Language Models

2024-09-24arXiv.orgCitations: 0

Authors to Follow