Fat-to-Thin Policy Optimization: Offline RL with Sparse Policies

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

In safety-critical offline reinforcement learning, sparse policies—characterized by zero-probability action support—suffer from policy evaluation failure due to action-support mismatch with the behavior policy. To address this, we propose Fat-to-Thin Policy Optimization (FtTPO), a novel framework that introduces heavy-tailed proposal policies, uniformly modeled via the q-Gaussian distribution family. FtTPO enables progressive knowledge transfer from broad action support to sparse, safety-constrained target policies through offline policy iteration coupled with support-aware constrained optimization. Evaluated on medical treatment simulation and MuJoCo benchmarks, FtTPO significantly outperforms existing offline RL methods, achieving superior performance while rigorously preserving policy safety. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Sparse continuous policies are distributions that can choose some actions at random yet keep strictly zero probability for the other actions, which are radically different from the Gaussian. They have important real-world implications, e.g. in modeling safety-critical tasks like medicine. The combination of offline reinforcement learning and sparse policies provides a novel paradigm that enables learning completely from logged datasets a safety-aware sparse policy. However, sparse policies can cause difficulty with the existing offline algorithms which require evaluating actions that fall outside of the current support. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO). Specifically, we maintain a fat (heavy-tailed) proposal policy that effectively learns from the dataset and injects knowledge to a thin (sparse) policy, which is responsible for interacting with the environment. We instantiate FtTPO with the general $q$-Gaussian family that encompasses both heavy-tailed and sparse policies and verify that it performs favorably in a safety-critical treatment simulation and the standard MuJoCo suite. Our code is available at url{https://github.com/lingweizhu/fat2thin}.

Problem

Research questions and friction points this paper is trying to address.

Sparse Continuous Policies

Uncommon Actions Evaluation

Safe Strategy Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

FtTPO

Sparse Continuous Policy Optimization

Q-Gaussian Family

🔎 Similar Papers

Offline Hierarchical Reinforcement Learning via Inverse Optimization