DVAGen: Dynamic Vocabulary Augmented Generation

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fixed vocabulary sizes hinder large language models’ (LLMs) generalization to novel and out-of-vocabulary tokens. To address this, we propose the first open-source dynamic vocabulary framework tailored for modern LLMs, enabling real-time vocabulary expansion during both training and inference. Our method introduces a modular pipeline compatible with mainstream open-weight LLMs, integrating batched inference, a command-line interface (CLI), and an interactive web-based UI, while supporting end-to-end training, evaluation, and visualization. Crucially, our approach embeds dynamic vocabulary adaptation seamlessly into the LLM’s generation process—without architectural modifications or retraining. Experimental results demonstrate substantial improvements in novel-token handling accuracy and inference throughput across multiple benchmark tasks, validating the framework’s effectiveness, efficiency, and practical utility.

Technology Category

Application Category

📝 Abstract
Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.
Problem

Research questions and friction points this paper is trying to address.

Addresses language models' inability to handle novel vocabulary words
Solves fragmented codebases and scalability in dynamic vocabulary approaches
Provides unified framework for training and evaluating vocabulary-augmented models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source framework for dynamic vocabulary training
Modular pipeline with CLI and WebUI tools
Supports batch inference for improved throughput
🔎 Similar Papers
No similar papers found.
W
Wei Du
School of Computer Science and Technology, East China Normal University
N
Nuowei Liu
School of Computer Science and Technology, East China Normal University
J
Jie Wang
School of Computer Science and Technology, East China Normal University
J
Jiahao Kuang
School of Computer Science and Technology, East China Normal University
Tao Ji
Tao Ji
中国人民大学
X
Xiaoling Wang
School of Computer Science and Technology, East China Normal University
Y
Yuanbin Wu
School of Computer Science and Technology, East China Normal University