🤖 AI Summary
OCR and vision-language models (VLMs) exhibit critical failures in financial document understanding—particularly in accurately recognizing numerics and dates—while conventional evaluation metrics (e.g., ROUGE) only assess surface-level text similarity and fail to detect high-risk errors such as digit inversions or date offsets. Method: We introduce FinCriticalED, the first fact-level visual benchmark tailored for finance, emphasizing structural sensitivity and zero-tolerance requirements for numeric and temporal facts. It comprises 500 image–HTML pairs, expert-annotated for key financial facts, and integrates an LLM-as-Judge framework for structured fact extraction and contextual validation. Contribution/Results: Our automated evaluation pipeline significantly enhances semantic depth and rigor. Experiments reveal persistent factual inaccuracies—even in state-of-the-art proprietary models—under complex visual conditions, establishing a quantifiable, reproducible, and domain-specific evaluation paradigm for high-precision financial AI.
📝 Abstract
We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial documents contain visually dense and table heavy layouts where numerical and temporal information is tightly coupled with structure. In high stakes settings, small OCR mistakes such as sign inversion or shifted dates can lead to materially different interpretations, while traditional OCR metrics like ROUGE and edit distance capture only surface level text similarity. ficriticaled provides 500 image-HTML pairs with expert annotated financial facts covering over seven hundred numerical and temporal facts. It introduces three key contributions. First, it establishes the first fact level evaluation benchmark for financial document understanding, shifting evaluation from lexical overlap to domain critical factual correctness. Second, all annotations are created and verified by financial experts with strict quality control over signs, magnitudes, and temporal expressions. Third, we develop an LLM-as-Judge evaluation pipeline that performs structured fact extraction and contextual verification for visually complex financial documents. We benchmark OCR systems, open source vision language models, and proprietary models on FinCriticalED. Results show that although the strongest proprietary models achieve the highest factual accuracy, substantial errors remain in visually intricate numerical and temporal contexts. Through quantitative evaluation and expert case studies, FinCriticalED provides a rigorous foundation for advancing visual factual precision in financial and other precision critical domains.