Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive language models with subword tokenizers exhibit limited robustness against out-of-vocabulary (OOV) multilingual spelling perturbations. To address this, we propose PixelLM—a novel pixel-based language modeling framework that renders text into character-level grayscale images and performs end-to-end convolutional autoregressive modeling, thereby eliminating reliance on subword embeddings and circumventing OOV issues entirely. PixelLM unifies cross-script representation through pixel-sequence modeling, requiring no language-specific preprocessing or tokenization; during decoding, it reconstructs text from pixels, preserving both generation quality and robustness. Evaluated on multilingual LAMBADA, WMT24, and SST-2 benchmarks under spelling noise, PixelLM achieves significant gains in accuracy and stability. Results demonstrate its strong robustness to diverse multilingual spelling variants and superior generalization across scripts and languages.

Technology Category

Application Category

📝 Abstract
Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.
Problem

Research questions and friction points this paper is trying to address.

Addressing vulnerability of language models to orthographic attacks
Overcoming out-of-vocabulary issues in subword tokenizers
Enhancing robustness against multilingual character perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-based generative language model replaces text embeddings
Renders words as individual images for representation
Provides robustness to multilingual orthographic attacks
🔎 Similar Papers
No similar papers found.