Back to Bytes: Revisiting Tokenization Through UTF-8

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Existing byte-level tokenizers often introduce out-of-bound token IDs (e.g., >255) or auxiliary special tokens, harming inference efficiency and model compatibility. This paper proposes UTF8Tokenizer: a minimalist, purely byte-level tokenizer that maps text strictly to UTF-8 byte IDs in [0, 255], eliminating out-of-bound IDs and extra tokens for the first time. It repurposes C0 control characters (0x00–0x1F) to uniformly encode special semantics (e.g., BOS, EOS, padding). Furthermore, it introduces zero-overhead bit-bias embeddings to explicitly model byte-level bit structure. The tokenizer is fully compatible with the Hugging Face ecosystem and supports standard shared 256×d embedding tables. Experiments demonstrate a 14× speedup in tokenization, an 8× reduction in host-to-device data transfer volume (compared to int64 representations), and significantly accelerated language model training convergence.

Technology Category

Application Category

📝 Abstract

We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021; Pagnoni et al., 2025), our implementation never introduces out-of-range IDs (i.e. there is no token ID 256) or auxiliary tokens: all special behavior (e.g., padding, boundaries, conversation structure, attention segments, tool calling, "thinking" spans, etc.) is encoded using C0 control bytes - just as ASCII was originally designed to embed control information alongside printable text. These design principles yield practical benefits: (1) faster tokenization (14x) and significantly lower host-device transfer (8x less than int64); (2) simple, shareable 256*d embedding tables that can be aligned across models; and (3) a training-time enhancement via bit-biased embeddings, which exposes per-byte bit structure and can be added to the embedding table post-training, removing inference costs. Our HuggingFace-compatible implementation improves language modeling convergence.

Problem

Research questions and friction points this paper is trying to address.

Develops byte-level tokenizer using UTF-8 encoding

Eliminates out-of-range token IDs and auxiliary tokens

Encodes special behaviors through C0 control bytes

Innovation

Methods, ideas, or system contributions that make the work stand out.

UTF8Tokenizer uses byte-level tokenization with UTF-8 encoding

It encodes special behaviors using C0 control bytes

It employs bit-biased embeddings for training enhancement

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Software Engineer Intern (AI Model Optimization) - 2026 Summer (BS/MS)

TikTok

Seattle, Washington

Authors to Follow