Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the inefficiency in tokenization that arises when large language models are applied to specialized domains such as law and medicine, where generic vocabularies often mismatch domain-specific terminology. To overcome this limitation, the authors propose a parameter-efficient vocabulary adaptation method that innovatively integrates domain-specific token injection with the replacement of inefficient or unreachable tokens. This approach significantly reduces parameter growth—by 37% compared to pure vocabulary expansion—while simultaneously accelerating training by 35–55% and improving generation quality. Experiments on Llama-3.1-8B and Qwen2.5-7B demonstrate that the method substantially enhances semantic similarity, coherence, and accuracy of domain-specific terminology in generated summaries under challenging evaluation protocols.

📝 Abstract

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

Problem

Research questions and friction points this paper is trying to address.

tokenization inefficiency

vocabulary mismatch

domain adaptation

text summarization

out-of-vocabulary

Innovation

Methods, ideas, or system contributions that make the work stand out.

vocabulary adaptation

parameter-efficient adaptation

domain-specific tokenization