Policy-Conditioned Policies for Multi-Agent Task Solving

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The core challenge in multi-agent coordination lies in the opacity of neural policies, which hinders policy conditioning. To address this, we propose *programmatic policy representation*, explicitly encoding policies as human-readable source code and leveraging large language models (LLMs) as both program interpreters and best-response operators. We introduce the *Programmatic Iterative Best Response* (PIBR) algorithm, which jointly optimizes game utility and unit-test feedback via textual gradients—enabling, for the first time in deep reinforcement learning, interpretable and logically grounded policy conditioning. Evaluated on coordinated matrix games and the Level-Based Foraging benchmark, our approach significantly outperforms end-to-end multi-agent RL baselines in three key dimensions: policy interpretability, dynamic adaptability to environmental changes, and convergence stability. The programmatic formulation facilitates explicit reasoning over policy logic, supports modular verification via unit tests, and enables transparent strategic adaptation through LLM-based program synthesis and refinement.

Technology Category

Application Category

📝 Abstract
In multi-agent tasks, the central challenge lies in the dynamic adaptation of strategies. However, directly conditioning on opponents' strategies is intractable in the prevalent deep reinforcement learning paradigm due to a fundamental ``representational bottleneck'': neural policies are opaque, high-dimensional parameter vectors that are incomprehensible to other agents. In this work, we propose a paradigm shift that bridges this gap by representing policies as human-interpretable source code and utilizing Large Language Models (LLMs) as approximate interpreters. This programmatic representation allows us to operationalize the game-theoretic concept of extit{Program Equilibrium}. We reformulate the learning problem by utilizing LLMs to perform optimization directly in the space of programmatic policies. The LLM functions as a point-wise best-response operator that iteratively synthesizes and refines the ego agent's policy code to respond to the opponent's strategy. We formalize this process as extit{Programmatic Iterated Best Response (PIBR)}, an algorithm where the policy code is optimized by textual gradients, using structured feedback derived from game utility and runtime unit tests. We demonstrate that this approach effectively solves several standard coordination matrix games and a cooperative Level-Based Foraging environment.
Problem

Research questions and friction points this paper is trying to address.

Addresses dynamic strategy adaptation in multi-agent tasks
Overcomes the representational bottleneck of opaque neural policies
Uses programmatic policies and LLMs for interpretable agent optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Represent policies as human-interpretable source code
Use LLMs as approximate interpreters for policy optimization
Apply Programmatic Iterated Best Response with textual gradients
🔎 Similar Papers
No similar papers found.