Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study identifies vocabulary mismatch—driven by high out-of-vocabulary (OOV) rates and terminology novelty—as the core bottleneck degrading large language model (LLM) performance in medical text summarization. We empirically reveal, for the first time, that even ultra-large-vocabulary models such as Llama-3.1 suffer from excessive subword segmentation of medical terms. To address this, we propose a multi-strategy vocabulary adaptation framework comprising vocabulary expansion, subword merging, and lightweight continual pretraining. Our method is rigorously evaluated across three diverse medical benchmarks—MIMIC-III, PubMed, and MedSecSum—demonstrating a +4.2 absolute improvement in ROUGE-L score. Furthermore, 92% of clinical experts rated the adapted summaries as more accurate and clinically relevant in human evaluation. We publicly release the full benchmark suite and codebase to establish a reproducible foundation for advancing vocabulary robustness in medical LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' performance drop in high OOV medical texts

Addressing vocabulary mismatch via medical domain adaptation

Improving summarization relevance with expert vocabulary updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vocabulary adaptation for medical OOV words

Continual pretraining strategies for LLMs

Human evaluation with medical experts

🔎 Similar Papers

No similar papers found.