KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

📅 2024-07-25

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing evaluations of large multimodal models (LMMs) lack foundational and developmental perspectives on visual analogical reasoning. Method: We introduce VABench—the first developmentally inspired visual analogy benchmark (4,300 everyday object transformation samples)—grounded in child developmental psychology. It systematically assesses LMMs across three reasoning stages: “what changed,” “how it changed,” and “rule transfer.” We incorporate basic visual analogies solvable by 3–5-year-old children, propose a multi-stage structured prompting framework, and conduct cross-age comparative analysis using human behavioral data. Results: State-of-the-art models (e.g., GPT-4V) approach adult-level performance on simple attribute recognition (e.g., color) but significantly underperform preschoolers on tasks requiring spatial representation and cross-object rule generalization (e.g., quantity, rotation). GPT-o1 achieves the best results among tested models yet remains substantially inferior to humans—revealing a fundamental limitation in embodied visual reasoning for current LMMs.

Technology Category

Application Category

📝 Abstract

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A"visual analogy"is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the"what"effectively, they struggle with quantifying the"how"and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.

Problem

Research questions and friction points this paper is trying to address.

Evaluates visual analogical reasoning in large multimodal models.

Compares model performance to children and adults on visual transformations.

Highlights limitations of models trained on 2D images and text.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed 4,300 visual transformations benchmark

Structured evaluation into three reasoning stages

Compared LMMs with children and adult performance

🔎 Similar Papers

No similar papers found.