From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing visual information extraction benchmarks suffer from limitations in scale, realism, and semantic granularity, making them inadequate for comprehensively evaluating multimodal large language models on complex receipt understanding tasks. To address this, this work introduces ReceiptBench, a large-scale, human-annotated benchmark comprising 10,000 diverse receipts, and systematically decomposes receipt understanding into four progressive subtasks: perception, format normalization, semantic reasoning, and structural parsing. The authors propose a two-stage training framework that integrates reinforcement learning with structured evaluation constraints, featuring a novel Metric-Aware Group Relative Policy Optimization (Metric-Aware GRPO) to enhance structural consistency. The approach achieves state-of-the-art performance, particularly excelling in implicit semantic reasoning and nested structure parsing. The dataset and code are publicly released to foster further research in the community.

📝 Abstract

Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at https://github.com/wwwT0ri/ReceiptBench.

Problem

Research questions and friction points this paper is trying to address.

Visual Information Extraction

Multimodal Large Language Models

Document Understanding

Benchmarking

Receipt Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReceiptBench

Multimodal Large Language Models

Visual Information Extraction