Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

While modern multimodal large language models (MLLMs) support dynamic image resolution inputs, existing evaluations focus on multi-focal semantic understanding while neglecting resolution robustness—the stability of model performance across varying input resolutions. Method: We introduce the first benchmark dedicated to resolution robustness, comprising 14,400 samples, 12 resolution levels, and six core capability dimensions. Our evaluation framework quantifies performance fluctuations using Spearman’s rank correlation coefficient, absolute/relative continuous error metrics, and systematically compares padding versus super-resolution preprocessing. We further analyze how fine-tuning enhances stability. Contribution/Results: Through large-scale experiments, we uncover previously unreported resolution-sensitivity patterns across mainstream MLLMs—revealing systematic performance degradation or instability at non-standard resolutions. This work establishes the first empirical foundation and methodological framework for modeling and improving resolution robustness in MLLMs.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce extbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking MLLM robustness to dynamic image resolutions

Evaluating performance stability across varying input resolutions

Assessing resolution robustness beyond traditional accuracy metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking MLLM robustness to dynamic resolution inputs

Introducing multiple metrics for performance stability evaluation

Evaluating preprocessing strategies and fine-tuning for stability enhancement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs