🤖 AI Summary
Existing OCR and translation benchmarks (e.g., OCRBench) overlook the evaluation of complex-layout, long-text understanding—particularly for menus. This paper introduces MOTBench, the first joint OCR-and-translation benchmark tailored to menu scenarios, focusing on accurate recognition and translation of dish names, prices, and units in multilingual, multi-font, cross-cultural menu images. Our contributions are threefold: (1) a fine-grained structured evaluation framework covering layout comprehension and semantic generation; (2) a high-consistency automated assessment system integrating field-level accuracy and semantic equivalence, calibrated via human-annotated bilingual (Chinese/English) reference data; and (3) culturally sensitive annotations and a real-world menu dataset. Experiments demonstrate strong agreement between automated metrics and human judgments (Spearman’s ρ > 0.92), and systematically expose consistent weaknesses of state-of-the-art LVLMs in price localization, unit identification, and culturally nuanced term translation. MOTBench is publicly released and adopted by the research community.
📝 Abstract
The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.