LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

πŸ“… 2026-05-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

198K/year
πŸ€– AI Summary
This work addresses the challenge of efficiently selecting the optimal multimodal large language model (MLLM) for image-text queries without relying on actual model outputs, thereby balancing performance, cost, and latency. The authors propose LatentRouter, which formulates routing as a counterfactual multimodal utility prediction problem. It leverages an implicit communication mechanism between routing capsules and model capability tokens to estimate each candidate model’s expected performance on the current task. LatentRouter introduces a utility-based dynamic routing strategy that accommodates variable candidate pools and explicit trade-offs between performance and cost, enhanced by bounded capsule correction to improve decision accuracy. Experiments demonstrate that LatentRouter significantly outperforms fixed-model baselines, feature-level approaches, and existing learned routers on MMR-Bench and VL-RouterBench, with particularly strong gains in visual understanding, layout awareness, and complex reasoning tasks.
πŸ“ Abstract
Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
model routing
utility prediction
counterfactual reasoning
capability matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

LatentRouter
multimodal routing
counterfactual utility prediction
latent communication
model capability token
πŸ”Ž Similar Papers
No similar papers found.