Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work proposes a novel approach to Chinese language modeling by directly using grayscale character images as small as 8×8 pixels as input, eschewing discrete character indices and predefined vocabularies. Leveraging a standard language modeling architecture augmented with a lightweight visual encoder, the model harnesses semantic and phonetic cues embedded in the visual structure of Chinese characters. Experiments demonstrate for the first time that extremely low-resolution character images can effectively support language modeling: the model achieves over 12% accuracy within only 0.4% of the total training steps and ultimately reaches 39.2% accuracy—comparable to conventional index-based methods (39.1%). The pronounced “hot-start” effect further underscores the efficacy and potential of visual signals in Chinese language modeling.

Technology Category

Application Category

📝 Abstract

Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as 8 x 8 pixels. Remarkably, these inputs achieve 39.2% accuracy, comparable to the index-based baseline of 39.1%. Such low-resource settings also exhibit a pronounced hot-start effect: by 0.4% of total training, accuracy reaches above 12%, while index-based models lag at below 6%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.

Problem

Research questions and friction points this paper is trying to address.

Chinese language modeling

visual tokens

character representation

logographic scripts

low-resolution images

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual tokens

Chinese language modeling

hot-start effect