🤖 AI Summary
Autoregressive language models with subword tokenizers exhibit limited robustness against out-of-vocabulary (OOV) multilingual spelling perturbations. To address this, we propose PixelLM—a novel pixel-based language modeling framework that renders text into character-level grayscale images and performs end-to-end convolutional autoregressive modeling, thereby eliminating reliance on subword embeddings and circumventing OOV issues entirely. PixelLM unifies cross-script representation through pixel-sequence modeling, requiring no language-specific preprocessing or tokenization; during decoding, it reconstructs text from pixels, preserving both generation quality and robustness. Evaluated on multilingual LAMBADA, WMT24, and SST-2 benchmarks under spelling noise, PixelLM achieves significant gains in accuracy and stability. Results demonstrate its strong robustness to diverse multilingual spelling variants and superior generalization across scripts and languages.
📝 Abstract
Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.