Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

πŸ“… 2026-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of efficiently selecting and connecting agents in large language model–based multi-agent systems to collaboratively solve complex tasks. It introduces, for the first time, learnable decentralized communication topology optimization into this domain. The approach formulates topology construction as a cooperative multi-agent reinforcement learning problem, leveraging the QMIX value decomposition framework under the centralized training with decentralized execution (CTDE) paradigm. A topology-aware graph neural network encoder, a GRU-based memory module, and a dedicated communication Q-head are integrated to dynamically generate communication graphs. Guided by a reward mechanism that balances task accuracy and token cost, the proposed method achieves state-of-the-art average performance across seven coding, reasoning, and mathematical benchmarks, outperforming existing frameworks by a 20.8% accuracy margin on HLE tasks, thereby significantly enhancing collaborative efficiency and robustness.
πŸ“ Abstract
Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.
Problem

Research questions and friction points this paper is trying to address.

Multi-Agent Systems
Agent Coordination
Topology Selection
LLM Collaboration
Communication Graph
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent Q-Mix
Multi-Agent Reinforcement Learning
QMIX
Topology Selection
Decentralized Execution
πŸ”Ž Similar Papers
No similar papers found.