Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

📅 2025-01-05

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit poor generalization on multi-step visual reasoning tasks and suffer from modality imbalance—particularly weak transfer from simple (SIMPLE) to hard (HARD) tasks. Method: We introduce the synthetic Algorithmic Visual Reasoning (AVR) benchmark, comprising three task families—Table Readout, Grid Navigation, and Visual Analogy—each with two difficulty levels. We propose the first Simple-to-Hard (S2H) generalization evaluation paradigm and design text-alignment tasks to quantify modality imbalance. Leveraging synthetic data generation, gradient alignment analysis, and cross-modal contrastive learning, we systematically investigate VLM behavior. Contribution/Results: Explicit image-to-text transformation significantly improves S2H generalization; state-of-the-art VLMs already fail on SIMPLE tasks, revealing fundamental fragility. Our gradient alignment metric reliably predicts generalization performance, offering a novel, interpretable pathway for VLM assessment.

Technology Category

Application Category

📝 Abstract

While Vision Language Models (VLMs) are impressive in tasks such as visual question answering (VQA) and image captioning, their ability to apply multi-step reasoning to images has lagged, giving rise to perceptions of modality imbalance or brittleness. Towards systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning (AVR), comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We seek strategies for training on the SIMPLE version of the tasks that improve performance on the corresponding HARD task, i.e., S2H generalization. This synthetic framework, where each task also has a text-only version, allows a quantification of the modality imbalance, and how it is impacted by training strategy. Ablations highlight the importance of explicit image-to-text conversion in promoting S2H generalization when using auto-regressive training. We also report results of mechanistic study of this phenomenon, including a measure of gradient alignment that seems to identify training strategies that promote better S2H generalization.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Imbalanced Modality Processing

Complex Image Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Models

Gradient Alignment Metric

Auto-regressive Training

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?