AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing evaluation methods that rely on static difficulty labels and simplistic metrics, which fail to capture the adaptive reasoning capabilities of vision-language models in dynamically selecting between tool-augmented visual reasoning and pure textual reasoning. To this end, we propose AdaptMMBench, a multimodal benchmark spanning five domains—real-world scenarios, OCR, GUI, knowledge, and mathematics—that introduces a dynamic difficulty identification mechanism grounded in model capability boundaries. Our framework employs the Matthews Correlation Coefficient (MCC) to assess the appropriateness of reasoning mode selection and enables multidimensional process analysis, including coverage of critical reasoning steps, tool effectiveness, and computational efficiency. Experiments reveal that adaptive capability improves with model scale yet remains decoupled from final accuracy, that critical step coverage positively correlates with performance, and that tool effectiveness varies significantly across model architectures.

Technology Category

Application Category

📝 Abstract
Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models'capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
Problem

Research questions and friction points this paper is trying to address.

adaptive multimodal reasoning
mode selection
dynamic difficulty
process evaluation
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive multimodal reasoning
mode selection
dynamic difficulty assessment
Matthews Correlation Coefficient (MCC)
process evaluation
🔎 Similar Papers
No similar papers found.
X
Xintong Zhang
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology; State Key Laboratory of General Artificial Intelligence, BIGAI
X
Xiaowen Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI; Xidian University
J
Jongrong Wu
State Key Laboratory of General Artificial Intelligence, BIGAI
Z
Zhi Gao
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology; Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
Shilin Yan
Shilin Yan
Fudan University
MLLMsComputer VisionMulti-Modal
Z
Zhenxin Diao
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
K
Kunpeng Gao
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
X
Xuanyan Chen
Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Y
Yunde Jia
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
Qing Li
Qing Li
Mohamed bin Zayed University of Artificial Intelligence: MBZUAI
Machine LearningLarge Language ModelOrdinary/Partial Differential Equation