Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This paper addresses zero-shot classification of unseen scripts (e.g., novel alphabets or low-resource languages) in document images, without fine-tuning or retraining. The proposed Rosetta model achieves cross-script symbol sequence recognition using only a few in-context examples. Methodologically, it introduces a Context-Aware Tokenizer (CAT), the first tokenizer enabling open-vocabulary, fine-grained modeling; integrates Multimodal In-Context Learning (MICL); and employs a controlled-information synthetic data generation mechanism to explicitly model joint visual–linguistic context. Experiments demonstrate that Rosetta successfully generalizes to previously unseen scripts—including Chinese, Greek, Russian, French, Spanish, and Japanese—on multilingual synthetic benchmarks. It achieves zero-shot, cross-distribution, and open-vocabulary symbol classification, significantly enhancing adaptability to emerging writing systems without requiring script-specific supervision or architectural modification.

Technology Category

Application Category

📝 Abstract

We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.

Problem

Research questions and friction points this paper is trying to address.

Classifying novel script patterns without retraining using minimal examples

Enhancing contextual learning with varied dataset generation for adaptability

Enabling open-vocabulary classification for unlimited text and symbol patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal In-Context Learning for novel scripts

Context-Aware Tokenizer enables open-vocabulary classification

Dataset generation enhances contextual adaptability

🔎 Similar Papers

No similar papers found.