Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the vulnerability of natural language inference (NLI) models to textual biases, superficial heuristics, and their reliance on task-specific fine-tuning. We propose a zero-shot multimodal NLI framework that grounds linguistic semantics in visual context: text-to-image generation models synthesize visual representations from premises, which are then aligned with hypothesis texts via cross-modal cosine similarity. By leveraging vision-based grounding as semantic representation, our approach mitigates inherent biases of purely textual modeling. To rigorously evaluate robustness, we construct a controllable adversarial dataset; additionally, we integrate a visual question answering mechanism to enhance inference consistency. Experiments demonstrate that, without any NLI-specific fine-tuning, our method achieves state-of-the-art zero-shot performance—significantly improving robustness against lexical substitutions, syntactic perturbations, and logical fallacies—while matching the accuracy of fully supervised baselines.

Technology Category

Application Category

📝 Abstract

We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.

Problem

Research questions and friction points this paper is trying to address.

Proposes zero-shot NLI using visual grounding to avoid textual biases

Generates premise images via text-to-image models for multimodal inference

Evaluates cosine similarity and VQA methods without task-specific fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot NLI using multimodal visual grounding

Text-to-image models generate premise visual representations

Inference via cosine similarity and visual question answering

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling