ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
State-of-the-art large multimodal models (LMMs) exhibit fundamental deficiencies in spatial cognition and visual understanding, while mainstream vision benchmarks have become ineffective due to rapid overfitting. Method: We propose the “Impossible Benchmark” paradigm and introduce ZeroBench—a lightweight, human-cognition-grounded visual reasoning benchmark comprising 100 meticulously crafted challenges (334 sub-questions), designed strictly according to principles of human visual reasoning and systematically validated via error attribution analysis and zero-shot evaluation across multiple models. Contribution/Results: ZeroBench achieves a uniform 0.0% accuracy across 20 state-of-the-art LMMs—marking the first benchmark to render current SOTA models entirely incapable of solving any item. It is robust against overfitting, possesses enduring evaluative value, and supports scalable extension. The benchmark is publicly released to advance foundational research in visual intelligence.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.
Problem

Research questions and friction points this paper is trying to address.

Assessing limitations in Large Multimodal Models' visual interpretation.
Creating a challenging benchmark for long-term relevance.
Evaluating models on ZeroBench, a difficult visual reasoning test.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ZeroBench benchmark
Evaluates 20 LMMs
Publicly releases ZeroBench
🔎 Similar Papers
No similar papers found.
J
Jonathan Roberts
University of Cambridge
Mohammad Reza Taesiri
Mohammad Reza Taesiri
EA Sports
VLMsLLMsPost-TrainingEvalsComputer Games
A
Ansh Sharma
University of Cambridge
Akash Gupta
Akash Gupta
University of Cambridge
S
Samuel Roberts
Independent Researcher
I
Ioana Croitoru
Independent Researcher
Simion-Vlad Bogolin
Simion-Vlad Bogolin
Institute of Mathematics of the Romanian Academy
Computer Vision
J
Jialu Tang
The University of Hong Kong
F
Florian Langer
University of Cambridge
Vyas Raina
Vyas Raina
University of Cambridge
Machine LearningDeep Learning
Vatsal Raina
Vatsal Raina
University of Cambridge
Speechoff-topic detectionquestion-answering
H
Hanyi Xiong
The University of Hong Kong
Vishaal Udandarao
Vishaal Udandarao
PhD Student, University of Tübingen & University of Cambridge
Data-centric MLFoundation ModelsVision and LanguageComputer Vision
Jingyi Lu
Jingyi Lu
The University of Hong Kong
S
Shiyang Chen
The University of Hong Kong
S
Sam Purkis
Independent Researcher
T
Tianshuo Yan
The University of Hong Kong
Wenye Lin
Wenye Lin
Fraunhofer Institute for Solar Energy Systems
Phase change material slurriesBuilding energy efficiencySustainable buildingsRenewable and sustainable technologies in the
G
Gyungin Shin
University of Oxford
Q
Qiaochu Yang
The University of Hong Kong
Anh Totti Nguyen
Anh Totti Nguyen
Associate Professor, Auburn University
Machine LearningExplainable AIComputer VisionNLP
K
Kai Han
The University of Hong Kong
Samuel Albanie
Samuel Albanie
Google DeepMind
AI OversightMachine LearningComputer VisionNatural Language Processing