From Language Models over Tokens to Language Models over Characters

📅 2024-12-04

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Modern language models operate at the token level, requiring manual character-to-token mapping, exhibiting sensitivity to prompt formatting, and entailing complex interfaces. This work addresses these limitations by proposing the first framework that rigorously transforms token-level language models into character-level models—eliminating reliance on predefined tokenizers. We design two conversion algorithms: (i) an exact method leveraging probability distribution reconstruction and dynamic programming, and (ii) an efficient approximation method incorporating entropy-regularized low-rank approximation. Evaluated on Llama 3.1 8B, our approach achieves 46.3 characters/second inference throughput with only 0.00021 bits/character distortion—nearly matching the ideal character-level distribution. This paradigm shift transcends conventional token-based modeling, significantly enhancing input robustness and API simplicity while preserving model fidelity.

Technology Category

Application Category

📝 Abstract

Modern language models are internally -- and mathematically -- distributions over token strings rather than emph{character} strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent analyses are very sensitive to the specification of the prompt (e.g., if the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. We find that -- even with a small computation budget -- our method is able to accurately approximate the character-level distribution (less than 0.00021 excess bits / character) at reasonably fast speeds (46.3 characters / second) on the Llama 3.1 8B language model.

Problem

Research questions and friction points this paper is trying to address.

Convert token-level language models to character-level ones

Address sensitivity of tokenizers to prompt specifications

Improve compression rate of language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Convert token-level models to character-level

Exact and approximate algorithms presented

Improve compression rate and runtime efficiency

🔎 Similar Papers

Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas