DVAGen: Dynamic Vocabulary Augmented Generation

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Fixed vocabulary sizes hinder large language models’ (LLMs) generalization to novel and out-of-vocabulary tokens. To address this, we propose the first open-source dynamic vocabulary framework tailored for modern LLMs, enabling real-time vocabulary expansion during both training and inference. Our method introduces a modular pipeline compatible with mainstream open-weight LLMs, integrating batched inference, a command-line interface (CLI), and an interactive web-based UI, while supporting end-to-end training, evaluation, and visualization. Crucially, our approach embeds dynamic vocabulary adaptation seamlessly into the LLM’s generation process—without architectural modifications or retraining. Experimental results demonstrate substantial improvements in novel-token handling accuracy and inference throughput across multiple benchmark tasks, validating the framework’s effectiveness, efficiency, and practical utility.

Technology Category

Application Category

📝 Abstract

Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

Problem

Research questions and friction points this paper is trying to address.

Addresses language models' inability to handle novel vocabulary words

Solves fragmented codebases and scalability in dynamic vocabulary approaches

Provides unified framework for training and evaluating vocabulary-augmented models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source framework for dynamic vocabulary training

Modular pipeline with CLI and WebUI tools

Supports batch inference for improved throughput

🔎 Similar Papers

No similar papers found.