Where do Large Vision-Language Models Look at when Answering Questions?

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the interpretability challenge in open-ended visual question answering (OVQA) with large vision-language models (LVLMs), specifically focusing on visual dependency and identification of critical image regions. To tackle interpretability bottlenecks arising from multi-encoder architectures, multi-resolution vision encoders, and free-form generative outputs, we propose three innovations: (1) the first heatmap-based visualization method tailored for multi-encoder LVLMs; (2) a vision-relevant token selection mechanism that quantitatively links attention distributions to answer correctness; and (3) an enhanced iGOS++ algorithm integrating gradient-weighted class activation mapping (Grad-CAM) with token-level attribution to enable fine-grained, generation-aware visual explanation. We systematically evaluate leading LVLMs across multiple vision-reasoning benchmarks and find that architectural choices—particularly the number of vision encoders—exert a stronger influence on visual grounding capability than LLM parameter count. Code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

Problem

Research questions and friction points this paper is trying to address.

Analyze visual attention in Large Vision-Language Models (LVLMs).

Identify image regions influencing LVLM-generated answers.

Evaluate LVLM performance on visual question answering benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends heatmap visualization for LVLMs

Selects visually relevant tokens for answers

Analyzes LVLM behavior on visual benchmarks

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions