๐ค AI Summary
This work addresses the challenge that multimodal large language models (MLLMs) exhibit significant heterogeneity in architecture and efficiency, making it difficult for any single model to simultaneously optimize both cost and accuracy. To this end, the authors propose MMR-Bench, the first standardized, budget-aware evaluation framework for multimodal routing. MMR-Bench supports dynamic routing strategy assessment under a fixed set of candidate models by incorporating modality-aware inputs, variable computational budgets, and diverse vision-language tasksโincluding OCR, visual question answering, and multimodal mathematical reasoning. The framework leverages modality-fusion signals to enhance routing quality and achieves zero-shot generalization across tasks and modalities. Experiments demonstrate that the proposed routing strategy surpasses the accuracy of the strongest individual model at approximately 33% of its computational cost and generalizes effectively to new datasets and purely textual tasks without any fine-tuning.
๐ Abstract
Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over-provisions compute on easy instances or sacrifices accuracy on hard ones. Query-level model selection (routing) addresses this tension, but extending routing from text-only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation. We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR-Bench provides (i) a controlled environment with modality-aware inputs and variable compute budgets, (ii) a broad suite of vision-language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single-model reference, oracle upper bounds, and representative routing policies. Using MMR-Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost-accuracy frontier and enable the routed system to exceed the strongest single model's accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero-shot to new datasets and text-only benchmarks without retuning, establishing MMR-Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: https://github.com/Hunter-Wrynn/MMR-Bench.