🤖 AI Summary
This study addresses the tendency of vision-language models (VLMs) to rely on linguistic priors when performing OCR on ancient Greek texts, often generating plausible but visually ungrounded outputs. By applying image perturbations, analyzing decoding distributions, and implementing decoding interventions, the authors systematically evaluate the actual dependence on visual information in both VLMs and traditional OCR systems using low-resource ancient Greek critical editions. They propose a token-level visual grounding metric based on conditional and image-absent decoding distributions, revealing fundamental differences in visual reliance between OCR-specific models and general-purpose VLMs. The findings demonstrate that fluent outputs do not necessarily reflect visual evidence; decoding interventions struggle to restore grounding, and language model post-processing merely corrects text ex post facto. These results underscore the need to move beyond aggregate accuracy metrics and incorporate explainable evaluation methods that assess visual grounding fidelity.
📝 Abstract
Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.