BabyVision: Visual Reasoning Beyond Language

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 2

career value

185K/year

🤖 AI Summary

This work addresses the overreliance of current multimodal large language models on linguistic priors and their consequent deficiency in foundational visual understanding—capabilities that even human infants possess—leading to markedly subpar performance on basic visual tasks. To systematically evaluate pure visual reasoning independent of language, the authors introduce BabyVision, a comprehensive benchmark comprising 388 non-linguistic visual tasks across four major categories and 22 subcategories. They further present BabyVision-Gen, a generative model tailored for this benchmark, along with an automated evaluation toolkit. Experimental results reveal that leading models, such as Gemini3-Pro-Preview (scoring 49.7), fall significantly short of adult human performance (94.1), underscoring a critical gap in foundational visual primitives and highlighting the need to advance multimodal models toward more human-like visual perception.

Technology Category

Application Category

📝 Abstract

While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.

Problem

Research questions and friction points this paper is trying to address.

visual reasoning

multimodal LLMs

visual understanding

language-independent vision

core visual skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

BabyVision

visual reasoning

multimodal LLMs