Overcoming Vocabulary Constraints with Pixel-level Fallback

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient subword tokenization coverage, poor efficiency, and limited generalizability for low-resource languages and non-Latin scripts, this paper proposes a pixel-level tokenization paradigm: text is rendered as images, and word embeddings are directly generated from pixel sequences using convolutional or Transformer encoders—bypassing conventional vocabulary constraints entirely. The key contribution is the first-ever *retraining-free* pixel-level fallback mechanism, enabling plug-and-play multilingual extension of monolingual pretrained models. Experiments demonstrate substantial improvements in English-centric machine translation performance and cross-lingual transfer capability, with decoding latency reduced by up to 37% compared to byte-level tokenization and vocabulary expansion baselines. Moreover, the method achieves significantly enhanced multilingual generalization across diverse scripts and resource-scarce languages.

Technology Category

Application Category

📝 Abstract
Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
Problem

Research questions and friction points this paper is trying to address.

Balancing efficiency and coverage in subword tokenization
Improving multilingual performance without extensive retraining
Reducing decoding latency via input compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-level encoder generates text embeddings
Improves multilingual performance without retraining
Reduces decoding latency via compression