🤖 AI Summary
This work proposes Noise-Conditioned Langevin Q-Learning (NC-LQL), a novel soft policy optimization framework for reinforcement learning that circumvents the need for explicit policy parameterization. Instead of representing a policy directly, NC-LQL samples actions from the Boltzmann distribution induced by the Q-function via Langevin dynamics. To enhance exploration and stabilize learning, the method incorporates multi-scale noise perturbations and a noise-conditioned Q-function, which together shape a progressively smoothed value landscape—enabling efficient transitions from global exploration to local refinement. Evaluated on the OpenAI Gym MuJoCo benchmark suite, NC-LQL achieves performance comparable to state-of-the-art diffusion-based approaches, offering a concise and effective alternative for online reinforcement learning without the limitations of policy expressivity or entropy estimation inherent in conventional soft policy methods.
📝 Abstract
Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.