Generative Actor-Critic with Soft Bridge Policies

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing maximum entropy reinforcement learning methods suffer from high computational and memory costs due to intractable marginal action densities and multiple forward passes. This work proposes SoftGAC, which introduces a novel structured stochastic bridging mechanism that constructs a single-step stochastic mapping from fixed latent variables to actions, enabling analytically tractable maximum entropy optimization within a single forward pass. The approach reformulates the objective as a pathwise relative entropy, which is precisely reduced to a sampled transition control energy. By modeling latent variables in the pre-tanh space and generating actions via a single sample, SoftGAC achieves substantial gains in efficiency. Experiments demonstrate that SoftGAC matches or exceeds the performance of diffusion and flow-matching policies across multiple continuous control benchmarks, while offering lower inference latency and a superior trade-off between computational cost and return.

📝 Abstract

Expressive generative policies such as diffusion and flow models are appealing for MaxEnt online reinforcement learning because of their ability to model multimodal and highly non-Gaussian action distributions. However, training effective soft generative policies faces two obstacles that often arise together. First, marginal action densities are often unavailable, so existing methods typically rely on entropy bounds, heuristic proxies or approximations. Second, iterative shared-parameter samplers raise inference cost and require backpropagation through time over repeated network evaluations, increasing memory cost and destabilizing policy optimization. These obstacles motivate us to seek a generative policy that exposes a tractable MaxEnt objective while requiring only a single sampled actor forward pass for action generation. To this end, we propose soft generative actor-critic (SoftGAC), whose actor defines a stochastic bridge from a fixed base latent to a terminal action latent in pre-tanh space. This structured bridge allows us to lift the MaxEnt objective as an analytically tractable path-wise relative-entropy objective against a high-entropy reference process. In practical finite-step implementation, this relative entropy reduces exactly to sampled transition control energy and thus provides principled soft regularization. Moreover, we keep the single-pass actor lightweight by using small step-specific bridge transitions, each evaluated only once per sampled action, while maintaining a parameter budget comparable to strong actor baselines. Extensive experiments on challenging continuous-control benchmarks show that SoftGAC attains higher or competitive returns than strong generative policy baselines, including diffusion and flow-matching policies, while staying in the low-latency regime of one-pass actors and showing considerable improvements in the compute-return tradeoff.

Problem

Research questions and friction points this paper is trying to address.

generative policy

maximum entropy reinforcement learning

action distribution

inference cost

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

generative actor-critic

soft bridge policies

maximum entropy reinforcement learning