LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) benchmarks are predominantly task-oriented, with inconsistent sample distributions, hindering systematic analysis of the interplay between low-level perception and high-level reasoning. Method: We introduce LENS, the first capability-evolution-oriented multimodal reasoning benchmark, built upon 3.4K real-world social media images (53% posted after January 2025) and 60K+ expert-crafted questions, structured into a three-tiered progression: perception → comprehension → reasoning. It pioneers a full-task annotation paradigm under image-invariant conditions to enable cross-layer capability disentanglement and causal attribution. The benchmark incorporates multi-granularity design, crowd-sourced annotation, and a cross-model consistency evaluation framework. Results: Evaluated on 15+ state-of-the-art MLLMs—including Qwen2.5-VL-72B, InternVL3-78B, and GPT-4o—all models achieve ≤60% accuracy on the reasoning tier, exposing a fundamental bottleneck in current multimodal reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' reasoning in complex real-world scenarios
Assessing synergistic effects of perception on higher-order reasoning
Benchmarking MLLMs' performance from perception to compositional reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level benchmark with 3.4K images
60K+ human-authored questions across tasks
Evaluates MLLMs from perception to reasoning
🔎 Similar Papers
No similar papers found.
R
Ruilin Yao
Wuhan University of Technology
B
Bo Zhang
Wuhan University of Technology
J
Jirui Huang
Wuhan University of Technology
Xinwei Long
Xinwei Long
Tsinghua University
natural language processingmulti-modal learning
Y
Yifang Zhang
Wuhan University of Technology
T
Tianyu Zou
Wuhan University of Technology
Y
Yufei Wu
Wuhan University of Technology
S
Shichao Su
Wuhan University of Technology
Y
Yifan Xu
Wuhan University of Technology
W
Wenxi Zeng
Wuhan University of Technology
Z
Zhaoyu Yang
Wuhan University of Technology
G
Guoyou Li
Wuhan University of Technology
S
Shilan Zhang
Wuhan University of Technology
Z
Zichan Li
Wuhan University of Technology
Yaxiong Chen
Yaxiong Chen
Wuhan University of Technology
deep hashing、deep learning
Shengwu Xiong
Shengwu Xiong
Wuhan University of Technology
Artificial Intelligence
P
Peng Xu
Tsinghua University
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing
B
Bowen Zhou
Shanghai AI Lab
D
David Clifton
University of Oxford
Luc Van Gool
Luc Van Gool
professor computer vision INSAIT Sofia University, em. KU Leuven, em. ETHZ, Toyota Lab TRACE
computer visionmachine learningAIautonomous carscultural heritage