Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Prior work lacks systematic comparison of large language models (LLMs) versus conventional machine translation (MT) systems for medical consultation summarization across morphologically diverse languages—specifically English-to-Arabic, Chinese, and Vietnamese—with explicit distinction between patient-friendly and clinician-oriented texts. Method: We evaluated state-of-the-art open-source and commercial LLMs (e.g., Llama, Qwen) alongside statistical and neural MT systems (e.g., Google Translate, DeepL, OpenNMT) using standard automatic metrics (BLEU, METEOR) on domain-specific medical summaries. Contribution/Results: Conventional MT outperformed LLMs overall—especially on morphologically complex, terminology-dense clinical text—though LLMs approached MT quality on simplified Vietnamese and Chinese summaries and unexpectedly surpassed baselines in Arabic. Critically, all automated methods failed to ensure clinical accuracy. The study reveals nonlinear interactions between linguistic morphology, text type, and translation performance, and exposes fundamental limitations of generic evaluation metrics in capturing clinical relevance—underscoring the necessity of human expert validation and domain-specific fine-tuning.

Technology Category

Application Category

📝 Abstract

This study evaluates how well large language models (LLMs) and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while LLMs showed promise, particularly in Vietnamese and Chinese, when translating simpler summaries. Arabic translations improved with complexity due to the language's morphology. Overall, while LLMs offer contextual flexibility, they remain inconsistent, and current evaluation metrics fail to capture clinical relevance. The study highlights the need for domain-specific training, improved evaluation methods, and human oversight in medical translation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs vs traditional MT for medical translations

Assessing performance across English, Arabic, Chinese, Vietnamese

Identifying gaps in clinical relevance and evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs vs traditional MT in medical translations

Assesses translations using standard automated metrics

Highlights need for domain-specific training and oversight

🔎 Similar Papers

No similar papers found.