MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Text-to-image (T2I) models exhibit insufficient semantic consistency across syntactically distinct but logically equivalent prompts, revealing fundamental weaknesses in logical reasoning robustness. To address this, we propose MetaLogic—the first unsupervised, scalable T2I robustness evaluation framework grounded in metamorphic testing. It automatically generates semantically equivalent prompt pairs and detects fine-grained logical inconsistencies in generated images without requiring ground-truth references. Crucially, MetaLogic formalizes logical consistency as a cross-prompt alignment problem and systematically categorizes failure modes—including entity omission, duplication, and positional misalignment. Extensive evaluation on state-of-the-art models reveals severe deficiencies: Flux.dev and DALL·3 exhibit logical alignment failure rates of 59% and 71%, respectively—defects entirely overlooked by conventional evaluation metrics. MetaLogic thus uncovers previously undetected, deep-seated semantic flaws in contemporary T2I systems.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-image (T2I) models, especially diffusion-based architectures, have significantly improved the visual quality of generated images. However, these models continue to struggle with a critical limitation: maintaining semantic consistency when input prompts undergo minor linguistic variations. Despite being logically equivalent, such prompt pairs often yield misaligned or semantically inconsistent images, exposing a lack of robustness in reasoning and generalisation. To address this, we propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images. MetaLogic leverages metamorphic testing, generating image pairs from prompts that differ grammatically but are semantically identical. By directly comparing these image pairs, the framework identifies inconsistencies that signal failures in preserving the intended meaning, effectively diagnosing robustness issues in the model's logic understanding. Unlike existing evaluation methods that compare a generated image to a single prompt, MetaLogic evaluates semantic equivalence between paired images, offering a scalable, ground-truth-free approach to identifying alignment failures. It categorises these alignment errors (e.g., entity omission, duplication, positional misalignment) and surfaces counterexamples that can be used for model debugging and refinement. We evaluate MetaLogic across multiple state-of-the-art T2I models and reveal consistent robustness failures across a range of logical constructs. We find that even the SOTA text-to-image models like Flux.dev and DALLE-3 demonstrate a 59 percent and 71 percent misalignment rate, respectively. Our results show that MetaLogic is not only efficient and scalable, but also effective in uncovering fine-grained logical inconsistencies that are overlooked by existing evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

Evaluates text-to-image model robustness to logically equivalent prompts

Detects semantic inconsistencies without ground truth image comparisons

Identifies alignment failures like entity omission and positional misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates T2I models via logically equivalent prompts

Uses metamorphic testing without ground truth images

Identifies alignment failures by comparing image pairs

🔎 Similar Papers

No similar papers found.