Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?

📅 2024-04-29

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work investigates modality dependency disparities—between vision and language—and self-consistency in Vision-Language Model (VLM) decoders during answer generation versus post-hoc natural language explanation (e.g., chain-of-thought, CoT). We propose a perturbation-based method to quantify modality importance, revealing for the first time that the image modality contributes substantially more to explanation generation than to answer prediction. We further introduce the first cross-modal self-consistency evaluation framework tailored to VLM decoders, extending assessment dimensions to logical consistency and input robustness, and upgrading the VALSE benchmark to the decoder level. Experiments show that although textual inputs remain dominant for overall performance, visual inputs are critical for explanation generation; most VLM decoders exhibit significantly lower self-consistency than pure language models; and they underperform across diverse reasoning phenomena in VALSE.

Technology Category

Application Category

📝 Abstract

Vision and language model (VLM) decoders are currently the best-performing architectures on multimodal tasks. Next to answers, they are able to produce natural language explanations, either in post-hoc or CoT settings. However, it is not clear to what extent they are using the input vision and text modalities when generating answers or explanations. In this work, we investigate if VLMs rely on their input modalities differently when they produce explanations as opposed to answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing unimodal tests and measures to VLM decoders. We find that most tested VLMs are less self-consistent than LLMs. Text contributions in all tested VL decoders are more important than image contributions in all examined tasks. However, when comparing explanation generation to answer generation, the contributions of images are significantly stronger for generating explanations compared to answers. This difference is even larger in CoT compared to post-hoc explanations. Lastly, we provide an up-to-date benchmarking of state-of-the-art VL decoders on the VALSE benchmark, which before was restricted to VL encoders. We find that the tested VL decoders still struggle with most phenomena tested by VALSE.

Problem

Research questions and friction points this paper is trying to address.

Assess VLM decoder reliance on vision vs text modalities

Evaluate self-consistency of VLMs in explanation generation

Benchmark VL decoders on multimodal task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends unimodal tests to VLM decoders

Compares image and text modality contributions

Benchmarks VL decoders on VALSE benchmark

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts