🤖 AI Summary
This paper addresses the parameter explosion and inefficient exploration of policy-parameter space in explicit policy-conditioned value functions (EPVFs) for large-scale continuous control tasks. Methodologically, it introduces a scalable EPVF modeling framework that (i) integrates large-scale GPU-parallel simulation with large-batch training; (ii) designs an action-driven representation of policy parameters; (iii) proposes a novel neural architecture specialized for learning weight-space features; and (iv) incorporates weight clipping and scaled perturbations to stabilize training and improve generalization. Its key contribution is breaking the conventional value function’s reliance on fixed policies, enabling end-to-end gradient optimization over arbitrary policy parameters. Evaluated on a custom high-difficulty Ant environment, the method matches the performance of state-of-the-art algorithms—including PPO and SAC—demonstrating significant advantages in scalability, exploration efficiency, and policy generalization.
📝 Abstract
We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V({ heta}) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.