Direct Soft-Policy Sampling via Langevin Dynamics

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work proposes Noise-Conditioned Langevin Q-Learning (NC-LQL), a novel soft policy optimization framework for reinforcement learning that circumvents the need for explicit policy parameterization. Instead of representing a policy directly, NC-LQL samples actions from the Boltzmann distribution induced by the Q-function via Langevin dynamics. To enhance exploration and stabilize learning, the method incorporates multi-scale noise perturbations and a noise-conditioned Q-function, which together shape a progressively smoothed value landscape—enabling efficient transitions from global exploration to local refinement. Evaluated on the OpenAI Gym MuJoCo benchmark suite, NC-LQL achieves performance comparable to state-of-the-art diffusion-based approaches, offering a concise and effective alternative for online reinforcement learning without the limitations of policy expressivity or entropy estimation inherent in conventional soft policy methods.

Technology Category

Application Category

📝 Abstract

Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.

Problem

Research questions and friction points this paper is trying to address.

soft policies

reinforcement learning

Boltzmann distribution

entropy estimation

policy sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Langevin dynamics

soft policy

noise-conditioned Q-learning