Distribution Parameter Actor-Critic: Shifting the Agent-Environment Boundary for Diverse Action Spaces

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the lack of unified policy learning in reinforcement learning caused by ambiguous agent–environment boundaries and heterogeneous action spaces (discrete, continuous, or hybrid). We propose a distribution-parameterized modeling paradigm: the agent outputs parameters of the action distribution—not raw actions—enabling fully continuous policy optimization. To this end, we design a Distribution-Parameterized Policy Gradient (DPPG) estimator to reduce gradient variance, and introduce Interpolated Critic Learning (ICL) to stabilize critic training in the distribution-parameter space. Built upon the TD3 framework, our method integrates reparameterization, deterministic policy gradient theory, and bandit-inspired interpolated critic updates. Evaluated on MuJoCo and DeepMind Control Suite, it significantly outperforms TD3 across continuous-control benchmarks and remains highly competitive on discrete-action tasks. To our knowledge, this is the first approach achieving consistent, efficient policy learning across all action-type domains.

Technology Category

Application Category

📝 Abstract

We introduce a novel reinforcement learning (RL) framework that treats distribution parameters as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, mixed, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distribution Parameter Policy Gradient (DPPG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce interpolated critic learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical DPPG-based actor-critic algorithm, Distribution Parameter Actor-Critic (DPAC). Empirically, DPAC outperforms TD3 in MuJoCo continuous control tasks from OpenAI Gym and DeepMind Control Suite, and demonstrates competitive performance on the same environments with discretized action spaces.

Problem

Research questions and friction points this paper is trying to address.

Redefines agent-environment boundary for diverse action spaces

Develops low-variance policy gradient for continuous action spaces

Enhances critic learning for improved RL performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reparameterizes actions as distribution parameters

Introduces DPPG for lower variance gradients

Uses ICL to enhance critic learning

🔎 Similar Papers

No similar papers found.

Anthropic

$500,000—$850,000 USD

San Francisco, CA, USA

Research Scientist Intern, Robotic Control Policy (PhD)