Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

๐Ÿ“… 2025-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Handwritten Mathematical Expression Recognition (HMER) faces persistent challenges in structural modeling and symbol ambiguity due to unconstrained symbol layout and highly variable handwriting styles. This paper presents the first full-parameter fine-tuning of a vision-language model (VLM) for HMER without architectural modification, enabling a unified multi-task learning framework. We propose three synergistic auxiliary tasks: (1) Tree-Aware Chain-of-Thought, which performs structured spatial reasoning over expression trees; (2) Error-Driven Learning, dynamically correcting predictions for visually similar symbols via error feedback; and (3) Symbol Counting, enforcing symbol-level consistency in long expressions. Leveraging data-driven task design and joint optimization, our method achieves new state-of-the-art results on CROHME and HME100Kโ€”outperforming the lightweight specialized model SSAN by 16.31% and surpassing zero-shot Gemini 2.5 Flash by 24.42%.

Technology Category

Application Category

๐Ÿ“ Abstract
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: https://github.com/BFlameSwift/Uni-MuMER
Problem

Research questions and friction points this paper is trying to address.

Addressing HMER challenges in OCR due to layout and handwriting variability
Overcoming performance bottlenecks in prior isolated architectural modifications
Leveraging VLMs for unified multi-task HMER solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes VLM without architecture modification
Integrates Tree-CoT for spatial reasoning
Uses EDL and SC for error reduction
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yu Li
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
J
Jin Jiang
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
J
Jianhua Zhu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
S
Shuai Peng
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
B
Baole Wei
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Y
Yuxuan Zhou
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Liangcai Gao
Liangcai Gao
Peking University
artificial intelligence