🤖 AI Summary
This paper addresses zero-shot classification of unseen scripts (e.g., novel alphabets or low-resource languages) in document images, without fine-tuning or retraining. The proposed Rosetta model achieves cross-script symbol sequence recognition using only a few in-context examples. Methodologically, it introduces a Context-Aware Tokenizer (CAT), the first tokenizer enabling open-vocabulary, fine-grained modeling; integrates Multimodal In-Context Learning (MICL); and employs a controlled-information synthetic data generation mechanism to explicitly model joint visual–linguistic context. Experiments demonstrate that Rosetta successfully generalizes to previously unseen scripts—including Chinese, Greek, Russian, French, Spanish, and Japanese—on multilingual synthetic benchmarks. It achieves zero-shot, cross-distribution, and open-vocabulary symbol classification, significantly enhancing adaptability to emerging writing systems without requiring script-specific supervision or architectural modification.
📝 Abstract
We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.