🤖 AI Summary
This work addresses the vulnerability of natural language inference (NLI) models to textual biases, superficial heuristics, and their reliance on task-specific fine-tuning. We propose a zero-shot multimodal NLI framework that grounds linguistic semantics in visual context: text-to-image generation models synthesize visual representations from premises, which are then aligned with hypothesis texts via cross-modal cosine similarity. By leveraging vision-based grounding as semantic representation, our approach mitigates inherent biases of purely textual modeling. To rigorously evaluate robustness, we construct a controllable adversarial dataset; additionally, we integrate a visual question answering mechanism to enhance inference consistency. Experiments demonstrate that, without any NLI-specific fine-tuning, our method achieves state-of-the-art zero-shot performance—significantly improving robustness against lexical substitutions, syntactic perturbations, and logical fallacies—while matching the accuracy of fully supervised baselines.
📝 Abstract
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.