🤖 AI Summary
Existing pixel-based language models lack systematic multilingual pretraining, particularly for joint image-text understanding across diverse writing systems (e.g., Latin, Devanagari, Cyrillic, and Han characters). This work presents the first end-to-end pixel-level multilingual pretraining framework, extending the PIXEL architecture to jointly model images and text in English, Hindi, Ukrainian, and Simplified Chinese. We introduce multilingual mixed-rendered image pretraining and propose word-level probing alongside representation space alignment evaluation. Experiments demonstrate that the model significantly outperforms monolingual baselines on non-Latin script tasks; the four-language latent spaces exhibit strong cross-lingual alignment; it achieves zero-shot transfer to unseen languages; and word-level analysis confirms its capacity to capture rich cross-lingual semantic and morphological features. These results establish a foundation for truly multilingual vision-language modeling at the pixel level.
📝 Abstract
Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.