zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing static tokenizers exhibit poor generalization across domains and languages, resulting in excessively long token sequences and high inference overhead. To address this, we propose a dynamic, inference-time token compression mechanism: an incremental, online tokenizer inspired by the LZW algorithm that progressively merges input subwords into reusable hypertokens, with lightweight runtime embedding layers generating their contextual representations. Our approach requires no architectural modifications or full-parameter fine-tuning—only ~10 GPU-hours of causal language modeling adaptation and parameter-efficient fine-tuning. Experiments demonstrate 20–60% reduction in input/output sequence length, yielding substantial inference latency improvements while maintaining full compatibility with mainstream large language models. To our knowledge, this is the first hypertokenization framework enabling inference-time, zero-retraining, vocabulary-adaptive expansion.

Technology Category

Application Category

📝 Abstract

Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable"hypertokens"on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60%, with significant improvements in inference latency.

Problem

Research questions and friction points this paper is trying to address.

Dynamic token vocabulary adaptation for LLMs

Reducing token sequence length for efficiency

Improving inference latency via token compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token vocabulary adjustment at inference time

LZW compression for reusable hypertokens creation

Parameter-efficient finetuning for existing LLMs adaptation

🔎 Similar Papers

No similar papers found.