Do Image-Text Metrics Respect Semantic Invariances?

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

This study addresses the unclear sensitivity of existing reference-free image-text evaluation metrics to semantically invariant perturbations. We present the first systematic assessment of five state-of-the-art evaluators under semantic-preserving transformations—including spatial manipulations, object scaling and category substitution, and neutral linguistic paraphrasing—and reveal that average score fluctuations of 6–9% can lead to ranking reversals in up to 37% of cases. To mitigate this instability, we propose a post-hoc calibration method that substantially enhances robustness to non-semantic variations while preserving high correlation with the original evaluator scores. Empirically, our approach reduces median absolute sensitivity by approximately 50%, significantly improving metric invariance without compromising alignment with human judgments.

📝 Abstract

Reference-free image-to-text evaluators are now standard for scoring image-caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes -- spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities, where benign spatial edits and simple phrasing changes shift scores by $\approx$6--9\% on average, and for systems separated by just 0.7\%, these shifts can cause ranking flips in up to $\sim$37\% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.

Problem

Research questions and friction points this paper is trying to address.

semantic invariance

image-text alignment

reference-free evaluation

metric sensitivity

caption evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic invariance

image-text evaluation

invariance probing