π€ AI Summary
This work addresses the challenge of coordinating self-interested agents in open, distributed multi-agent systems, where the absence of centralized control hinders the simultaneous optimization of global efficiency and long-term reuse of shared resources such as KV caches. To this end, the authors propose the IEMAS framework, which uniquely integrates KV cache affinity into incentive mechanism design. By combining probabilistic quality-of-service prediction with a VCG-based bipartite matching algorithm, IEMAS achieves incentive compatibility and social optimality under many-to-many long-term matching. Experiments implemented on vLLM demonstrate that the proposed approach reduces average service costs by 35% and decreases end-to-end latency by up to 2.9Γ compared to baseline methods.
π Abstract
The transition to open, distributed Multi-Agent Systems (MAS) promises scalable intelligence but introduces a non-trivial tension: maximizing global efficiency requires cooperative, resource-aware scheduling, yet autonomous agents may be self-interested and cannot be managed by a centralized controller. Prior approaches fall short in two key areas: they typically focus on single-query routing, neglecting long-term resource reuse (e.g., KV-caching) and the complexities of system-level many-to-many matching; furthermore, they rely on generic incentive mechanisms that ignore the distinct characteristics of LLM inference. To bridge this gap, we propose IEMAS (Incentive-Efficiency Mechanism for Multi-Agent Systems), a distributed framework that aligns economic incentives with system performance. IEMAS integrates a probabilistic predictive model to estimate Quality of Service (QoS) under uncertainty, which feeds into a VCG-based bipartite matching mechanism. This design guarantees truthful capability reporting and social optimality while explicitly leveraging KV cache-affinity to minimize computational redundancy. We implement IEMAS on top of vLLM and evaluate it via extensive simulations. Results demonstrate that our incentive-efficiency co-design reducing average service cost by 35% and end-to-end latency by up to 2.9
compared to baselines.