SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an agent-level speculative acceleration framework to address the high latency and low concurrency in multimodal large language model agents caused by sequential perception–reasoning–tool invocation loops. The framework introduces, for the first time, a cognition-gating mechanism grounded in answer decomposability, enabling self-verified speculative planning. It leverages a lightweight, tool-free model to predict execution trajectories and preemptively terminate redundant tool chains. Furthermore, a heterogeneous parallel funnel architecture is designed to integrate the stateless, high-concurrency inference of small models with the stateful, sequential reasoning of large models. Experiments demonstrate that the system achieves 1.1–3.35× speedup on V* Bench, HR-Bench, and POPE, with accuracy improvements of up to 6.7%, substantially enhancing throughput.

Technology Category

Application Category

📝 Abstract
Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.
Problem

Research questions and friction points this paper is trying to address.

agentic multimodal LLMs
sequential overhead
agentic depth
latency
system concurrency
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative planning
agentic multimodal LLMs
cognitive gating
heterogeneous parallel funnel
tool invocation acceleration
🔎 Similar Papers
No similar papers found.