Data-Efficient Safe Policy Improvement Using Parametric Structure

πŸ“… 2025-07-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the Safe Policy Improvement (SPI) problem in offline reinforcement learning: given only a fixed dataset and the behavior policy, how to efficiently construct a new policy with high-confidence guarantees of outperforming the behavior policy. To tackle low data efficiency and difficulty in reliability assurance, we propose a parametric SPI algorithm that (i) models the parameter-dependent structure of transition dynamics in closed form for the first time; (ii) abstracts policy competition via game-theoretic modeling and employs SMT-driven preprocessing for rigorous constraint solving; and (iii) incorporates high-fidelity transition dynamics estimation to enhance generalization. Experiments demonstrate that our method reduces the sample complexity required for safe policy improvement by two to three orders of magnitude, significantly outperforming existing SPI approaches across multiple benchmark tasks. Ablation studies confirm the critical contributions of each component to both empirical performance and theoretical safety guarantees.

Technology Category

Application Category

πŸ“ Abstract
Safe policy improvement (SPI) is an offline reinforcement learning problem in which a new policy that reliably outperforms the behavior policy with high confidence needs to be computed using only a dataset and the behavior policy. Markov decision processes (MDPs) are the standard formalism for modeling environments in SPI. In many applications, additional information in the form of parametric dependencies between distributions in the transition dynamics is available. We make SPI more data-efficient by leveraging these dependencies through three contributions: (1) a parametric SPI algorithm that exploits known correlations between distributions to more accurately estimate the transition dynamics using the same amount of data; (2) a preprocessing technique that prunes redundant actions from the environment through a game-based abstraction; and (3) a more advanced preprocessing technique, based on satisfiability modulo theory (SMT) solving, that can identify more actions to prune. Empirical results and an ablation study show that our techniques increase the data efficiency of SPI by multiple orders of magnitude while maintaining the same reliability guarantees.
Problem

Research questions and friction points this paper is trying to address.

Improving offline reinforcement learning data efficiency
Exploiting parametric dependencies in transition dynamics
Pruning redundant actions via game and SMT techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametric SPI algorithm leveraging distribution correlations
Game-based abstraction pruning redundant actions
SMT solving for advanced action pruning