🤖 AI Summary
This work proposes a novel approach to Chinese language modeling by directly using grayscale character images as small as 8×8 pixels as input, eschewing discrete character indices and predefined vocabularies. Leveraging a standard language modeling architecture augmented with a lightweight visual encoder, the model harnesses semantic and phonetic cues embedded in the visual structure of Chinese characters. Experiments demonstrate for the first time that extremely low-resolution character images can effectively support language modeling: the model achieves over 12% accuracy within only 0.4% of the total training steps and ultimately reaches 39.2% accuracy—comparable to conventional index-based methods (39.1%). The pronounced “hot-start” effect further underscores the efficacy and potential of visual signals in Chinese language modeling.
📝 Abstract
Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as 8 x 8 pixels. Remarkably, these inputs achieve 39.2% accuracy, comparable to the index-based baseline of 39.1%. Such low-resource settings also exhibit a pronounced hot-start effect: by 0.4% of total training, accuracy reaches above 12%, while index-based models lag at below 6%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.