MathWriting: A Dataset For Handwritten Mathematical Expression Recognition

📅 2024-04-16
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Low accuracy and scarcity of large-scale, high-quality annotated data hinder handwritten mathematical expression (HME) recognition. To address this, we introduce MathWriting—the largest publicly available online handwritten mathematical formula dataset to date—comprising 230k real touchscreen samples and 400k synthetically generated ones, each paired with high-fidelity stroke images and syntactically normalized LaTeX ground truth. We propose the first LaTeX-structured annotation schema and a standardized preprocessing pipeline, significantly enhancing model robustness. MathWriting supports unified benchmarking across diverse paradigms, including OCR-based, CTC-Transformer, and vision-language models (e.g., PaLI). We provide systematic baseline evaluations of multiple state-of-the-art models on this dataset. The dataset, training code, and interactive Colab notebooks are fully open-sourced, establishing critical infrastructure for advancing HME research and scientific note digitization.

Technology Category

Application Category

📝 Abstract
Recognition of handwritten mathematical expressions allows to transfer scientific notes into their digital form. It facilitates the sharing, searching, and preservation of scientific information. We introduce MathWriting, the largest online handwritten mathematical expression dataset to date. It consists of 230k human-written samples and an additional 400k synthetic ones}. This dataset can also be used in its rendered form for offline HME recognition. One MathWriting sample consists of a formula written on a touch screen and a corresponding LaTeX expression. We also provide a normalized version of LaTeX expression to simplify the recognition task and enhance the result quality. We provide baseline performance of standard models like OCR and CTC Transformer as well as Vision-Language Models like PaLI on the dataset. The dataset together with an example colab is accessible on Github.
Problem

Research questions and friction points this paper is trying to address.

Recognize handwritten math expressions for digital conversion
Create large dataset for offline HME recognition
Improve accuracy with normalized LaTeX expressions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest online handwritten math dataset
Combines human and synthetic samples
Provides normalized LaTeX for recognition
🔎 Similar Papers
No similar papers found.
P
Philippe Gervais
Google Research
A
Asya Fadeeva
Google Research
Andrii Maksai
Andrii Maksai
Google Deepmind