Adaptive Vision-Language Model Routing for Computer Use Agents

๐Ÿ“… 2026-03-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge that existing GUI agents typically rely on a fixed visual language model (VLM), struggling to balance cost and accuracy. To overcome this limitation, the authors propose an Adaptive VLM Routing framework (AVR) that employs a lightweight semantic routing layer to dynamically assess task difficulty based on multimodal embeddings and the confidence of small models. Guided by a reliability threshold, AVR selects the lowest-cost VLM that meets the required accuracy. The approach uniquely integrates action-difficulty awareness and costโ€“accuracy trade-offs into the routing mechanism, leverages retrieval-augmented context to narrow the performance gap between small and large models, and incorporates a Visual Confused Deputy safety guardrail. Evaluated on ScreenSpot-Pro and OpenClaw benchmarks, AVR reduces inference costs by up to 78% with no more than a 2-percentage-point drop in accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Model
Model Routing
Computer Use Agents
Grounding Accuracy
Task Difficulty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive VLM Routing
Vision-Language Model
Computer Use Agents
Cost-Accuracy Trade-off
Semantic Routing
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xunzhuo Liu
vLLM Semantic Router Project
Bowei He
Bowei He
City University of Hong Kong, MBZUAI
Data MiningLanguage ModelGenAI4ScienceAgentic AI
X
Xue Liu
MBZUAI, McGill University
Andy Luo
Andy Luo
Unknown affiliation
H
Haichen Zhang
AMD
H
Huamin Chen
Red Hat