SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the inconsistent cross-modal reasoning capability of vision-language models (VLMs) between textual and visual modalities. To this end, we introduce SEAM—the first semantic-equivalent, modality-heterogeneous benchmark for evaluating cross-modal alignment, covering four standardized image-text representation domains. We propose a non-OCR symbolic image-text pairing method to enable strictly controlled, consistency-aware evaluation. Our framework systematically reveals, for the first time, two pervasive deficiencies in current VLMs: (1) visual-modality reasoning lagging behind linguistic reasoning, and (2) low cross-modal output consistency—both robust to visual transformations. Experiments across 21 state-of-the-art models show that visual reasoning performance is significantly weaker than language reasoning, with average cross-modal output consistency below 60%. This work establishes a new paradigm for trustworthy VLM evaluation and alignment optimization, providing both methodological innovation and empirical grounding.

Technology Category

Application Category

📝 Abstract

Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-modal reasoning consistency in vision-language models

Addressing modality imbalance between vision and language performance

Identifying textual and visual perception failures causing hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark pairs equivalent inputs across modalities

Employs distinct notation systems for rigorous assessment

Measures modality imbalance and cross-modal agreement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs