Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Document understanding in automated invoice processing faces challenges in path selection—specifically, whether to process native images end-to-end or first apply OCR and convert to structured Markdown before text-based parsing. Method: This study systematically evaluates zero-shot information extraction performance of eight multimodal large language models (MLLMs) across three public invoice datasets, covering GPT-5, Gemini 2.5, and Gemma 3 families. It compares the two aforementioned paradigms under controlled conditions. Contribution/Results: Native image input generally outperforms OCR+Markdown pipelines, yet performance gaps critically depend on model architecture and invoice complexity (e.g., table density, layout heterogeneity). The work reveals systematic architectural divergences across MLLM families in document structure perception and vision–language alignment—providing the first empirical evidence of such differentiation. Findings yield actionable, reproducible guidance for model selection and pipeline design in document AI systems. The evaluation framework and code are publicly released.

Technology Category

Application Category

📝 Abstract

This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking multi-modal LLMs on invoice processing tasks

Comparing image-based versus text-based document parsing strategies

Evaluating model performance across diverse invoice datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks multi-modal LLMs on invoice datasets

Compares direct image vs structured parsing strategies

Finds native image processing generally outperforms

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions