Overcoming Vocabulary Constraints with Pixel-level Fallback

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

114K/year

🤖 AI Summary

To address insufficient subword tokenization coverage, poor efficiency, and limited generalizability for low-resource languages and non-Latin scripts, this paper proposes a pixel-level tokenization paradigm: text is rendered as images, and word embeddings are directly generated from pixel sequences using convolutional or Transformer encoders—bypassing conventional vocabulary constraints entirely. The key contribution is the first-ever *retraining-free* pixel-level fallback mechanism, enabling plug-and-play multilingual extension of monolingual pretrained models. Experiments demonstrate substantial improvements in English-centric machine translation performance and cross-lingual transfer capability, with decoding latency reduced by up to 37% compared to byte-level tokenization and vocabulary expansion baselines. Moreover, the method achieves significantly enhanced multilingual generalization across diverse scripts and resource-scarce languages.

Technology Category

Application Category

📝 Abstract

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.

Problem

Research questions and friction points this paper is trying to address.

Balancing efficiency and coverage in subword tokenization

Improving multilingual performance without extensive retraining

Reducing decoding latency via input compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-level encoder generates text embeddings

Improves multilingual performance without retraining

Reduces decoding latency via compression

🔎 Similar Papers

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring