Rethinking Model Efficiency: Multi-Agent Inference with Large Models

πŸ“… 2026-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high end-to-end latency in vision-language models caused by autoregressive generation of long output sequences, a challenge exacerbated by the fact that smaller models often require longer sequences to approach the performance of larger counterparts. To overcome this, the authors propose a multi-agent collaborative reasoning framework that enables cross-model transfer of critical reasoning tokens. This approach leverages the large model’s advantage in producing concise, high-quality responses while selectively reusing efficient inference results from smaller models as needed. The method achieves, for the first time, effective token-level knowledge transfer across heterogeneous models, substantially reducing latency. Empirical evaluations on multiple real-world benchmarks demonstrate that the proposed framework attains performance comparable to that of large models operating independently, yet with significantly improved computational efficiency.
πŸ“ Abstract
Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.
Problem

Research questions and friction points this paper is trying to address.

model efficiency
vision-language models
output tokens
inference latency
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent inference
model efficiency
vision-language models
output token reduction
large language models
πŸ”Ž Similar Papers
No similar papers found.