🤖 AI Summary
Fixed vocabulary sizes hinder large language models’ (LLMs) generalization to novel and out-of-vocabulary tokens. To address this, we propose the first open-source dynamic vocabulary framework tailored for modern LLMs, enabling real-time vocabulary expansion during both training and inference. Our method introduces a modular pipeline compatible with mainstream open-weight LLMs, integrating batched inference, a command-line interface (CLI), and an interactive web-based UI, while supporting end-to-end training, evaluation, and visualization. Crucially, our approach embeds dynamic vocabulary adaptation seamlessly into the LLM’s generation process—without architectural modifications or retraining. Experimental results demonstrate substantial improvements in novel-token handling accuracy and inference throughput across multiple benchmark tasks, validating the framework’s effectiveness, efficiency, and practical utility.
📝 Abstract
Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.