SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses unsupervised reinforcement learning pretraining—specifically, learning highly exploratory and task-agnostic policies from arbitrary off-policy datasets without access to task-specific reward signals. We propose the first Markov policy directly optimized in the stationary state distribution space, with the objective of maximizing the entropy of the induced stationary state distribution. Leveraging the DICE framework, we derive a provably optimal dual formulation for stationary entropy maximization. Our method requires neither environment interaction nor on-policy data re-sampling, thereby significantly improving both the accuracy of stationary entropy estimation and the intrinsic exploration capability of the learned policy. Empirically, fine-tuning the pretrained policy on downstream tasks consistently outperforms existing approaches—including SEM—establishing our framework as the current state-of-the-art paradigm for unsupervised RL pretraining.

Technology Category

Application Category

📝 Abstract
In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.
Problem

Research questions and friction points this paper is trying to address.

Maximizes state entropy in unsupervised RL pre-training
Learns policy from arbitrary off-policy datasets
Improves adaptation efficiency for downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy state entropy maximization via stationary distribution correction
Direct optimization within stationary distribution space
Single policy computation from arbitrary off-policy dataset
🔎 Similar Papers
No similar papers found.