Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses zero-shot classification of unseen scripts (e.g., novel alphabets or low-resource languages) in document images, without fine-tuning or retraining. The proposed Rosetta model achieves cross-script symbol sequence recognition using only a few in-context examples. Methodologically, it introduces a Context-Aware Tokenizer (CAT), the first tokenizer enabling open-vocabulary, fine-grained modeling; integrates Multimodal In-Context Learning (MICL); and employs a controlled-information synthetic data generation mechanism to explicitly model joint visual–linguistic context. Experiments demonstrate that Rosetta successfully generalizes to previously unseen scripts—including Chinese, Greek, Russian, French, Spanish, and Japanese—on multilingual synthetic benchmarks. It achieves zero-shot, cross-distribution, and open-vocabulary symbol classification, significantly enhancing adaptability to emerging writing systems without requiring script-specific supervision or architectural modification.

Technology Category

Application Category

📝 Abstract
We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.
Problem

Research questions and friction points this paper is trying to address.

Classifying novel script patterns without retraining using minimal examples
Enhancing contextual learning with varied dataset generation for adaptability
Enabling open-vocabulary classification for unlimited text and symbol patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal In-Context Learning for novel scripts
Context-Aware Tokenizer enables open-vocabulary classification
Dataset generation enhances contextual adaptability
🔎 Similar Papers
No similar papers found.
T
Tom Simon
LITIS EA4108, University of Rouen Normandy, France
W
William Mocaer
LITIS EA4108, University of Rouen Normandy, France
Pierrick Tranouez
Pierrick Tranouez
Litis, University of Rouen Normandie
Agent-based modelingmobility modelingdigital humanitiesdocument image analysismachine learning
C
C. Chatelain
LITIS EA4108, INSA of Rouen Normandy, France
Thierry Paquet
Thierry Paquet
University of Rouen Normandy, LITIS
Machine LearningHandwriting RecognitionReading SystemsDocument Image Analysis