Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

πŸ“… 2025-04-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language models (VLMs) exhibit modest average performance gains on OCR tasks, yet suffer from unreliable sample-level output quality and lack unsupervised, training-free confidence estimation mechanisms. Method: We propose Consensus Entropyβ€”a zero-training, zero-fine-tuning, plug-and-play post-processing method that quantifies uncertainty without parameters by modeling output-space consistency across multiple VLMs. It leverages the empirical observation that VLM predictions converge for correct outputs and diverge for errors, enabling self-verification via consensus entropy computation and supporting self-optimized result fusion. Contribution/Results: On multiple OCR benchmarks, our method improves F1 score by 15.2% over VLM-as-judge baselines and increases mathematical calculation accuracy by 6.0%, while modifying only 7.3% of inputs. It significantly enhances OCR robustness and output reliability without requiring model retraining or architectural modification.

Technology Category

Application Category

πŸ“ Abstract
The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.
Problem

Research questions and friction points this paper is trying to address.

Improving OCR accuracy in Vision-Language Models
Detecting low-quality OCR outputs automatically
Combining multiple VLMs for better OCR performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Consensus Entropy for OCR uncertainty quantification
Aggregates outputs from multiple VLMs for quality
Lightweight multi-model framework for optimal output
πŸ”Ž Similar Papers
No similar papers found.
Yulong Zhang
Yulong Zhang
Google
Security and Privacy
Tianyi Liang
Tianyi Liang
PHD, East China Normal University, Shanghai AI Lab,Shanghai Innovation Institute
Multimodal LearningLLMsImage Editing
X
Xinyue Huang
Sun Yat-sen University
Erfei Cui
Erfei Cui
Shanghai AI Laboratory; Shanghai JiaoTong University
Computer Vision
X
Xu Guo
Shanghai Artificial Intelligence Laboratory
P
Pei Chu
Shanghai Artificial Intelligence Laboratory
Chenhui Li
Chenhui Li
Baidu
AINLPCV
R
Ru Zhang
Beijing University of Posts and Telecommunications, China
W
Wenhai Wang
Shanghai Artificial Intelligence Laboratory
G
Gongshen Liu
Shanghai Jiao Tong University