🤖 AI Summary
This work addresses the inconsistent cross-modal reasoning capability of vision-language models (VLMs) between textual and visual modalities. To this end, we introduce SEAM—the first semantic-equivalent, modality-heterogeneous benchmark for evaluating cross-modal alignment, covering four standardized image-text representation domains. We propose a non-OCR symbolic image-text pairing method to enable strictly controlled, consistency-aware evaluation. Our framework systematically reveals, for the first time, two pervasive deficiencies in current VLMs: (1) visual-modality reasoning lagging behind linguistic reasoning, and (2) low cross-modal output consistency—both robust to visual transformations. Experiments across 21 state-of-the-art models show that visual reasoning performance is significantly weaker than language reasoning, with average cross-modal output consistency below 60%. This work establishes a new paradigm for trustworthy VLM evaluation and alignment optimization, providing both methodological innovation and empirical grounding.
📝 Abstract
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.