SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This paper addresses unsupervised reinforcement learning pretraining—specifically, learning highly exploratory and task-agnostic policies from arbitrary off-policy datasets without access to task-specific reward signals. We propose the first Markov policy directly optimized in the stationary state distribution space, with the objective of maximizing the entropy of the induced stationary state distribution. Leveraging the DICE framework, we derive a provably optimal dual formulation for stationary entropy maximization. Our method requires neither environment interaction nor on-policy data re-sampling, thereby significantly improving both the accuracy of stationary entropy estimation and the intrinsic exploration capability of the learned policy. Empirically, fine-tuning the pretrained policy on downstream tasks consistently outperforms existing approaches—including SEM—establishing our framework as the current state-of-the-art paradigm for unsupervised RL pretraining.

Technology Category

Application Category

📝 Abstract

In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.

Problem

Research questions and friction points this paper is trying to address.

Maximizes state entropy in unsupervised RL pre-training

Learns policy from arbitrary off-policy datasets

Improves adaptation efficiency for downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy state entropy maximization via stationary distribution correction

Direct optimization within stationary distribution space

Single policy computation from arbitrary off-policy dataset

🔎 Similar Papers

State-Constrained Offline Reinforcement Learning