🤖 AI Summary
Understanding and reasoning over academic handwritten notes—especially those containing mathematical formulas, diagrams, and scientific symbols—remains a significant challenge for vision-language models (VLMs). Existing VQA benchmarks primarily target printed or structured text, exhibiting limited generalizability to handwritten academic content. To address this gap, we introduce NoteBench, the first neural transcription and retrieval benchmark specifically designed for academic handwritten note understanding. It defines two task categories: evidence-localization and open-domain note question answering. Our evaluation paradigm is end-to-end, integrating visual grounding, cross-document retrieval, and multi-step reasoning—bypassing traditional OCR and emphasizing domain-aware vision-language joint modeling. We jointly evaluate VLMs with retrieval frameworks using multi-dimensional metrics: NDCG@5, MRR, IoU, and ANLS—to assess transcription accuracy, spatial localization, and retrieval effectiveness. Extensive experiments reveal critical bottlenecks in current VLMs regarding handwritten formula recognition, evidence localization, and cross-domain retrieval, establishing NoteBench as a reproducible, extensible benchmark for scientific note AI.
📝 Abstract
Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.