UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing multi-agent systems based on large language models rely heavily on handcrafted prompts and rules, lacking a unified reinforcement learning framework that supports customizable workflows, structured interactions, and flexible mechanisms for reward assignment and parameter sharing. This work proposes the first multi-agent reinforcement learning paradigm that treats an entire workflow as the optimization unit, introducing four first-class abstractions: logical roles, graph-based trajectories, user-defined rewards, and agent-to-model mappings. This design decouples logical agents from physical models, enabling fully shared, fully disjoint, or partially shared parameter configurations, along with three-level reward allocation across roles, turns, and trajectories. Built on Ray with a star-topology runtime, the system integrates a central controller and distributed PPO updates, supporting structured trajectory logging and tool invocation. Experiments on retrieval-augmented QA, iterative search, and code generation demonstrate substantial performance gains, particularly with smaller models and stringent code pass rates, validating the framework’s generality and effectiveness.

📝 Abstract

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

Problem

Research questions and friction points this paper is trying to address.

multi-agent systems

reinforcement learning

large language models

workflow optimization

credit assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reinforcement learning

LLM-based agents

workflow optimization