Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data

šŸ“… 2025-05-14
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This paper addresses the poor policy generalization across heterogeneous populations in offline reinforcement learning. We propose the first personalized policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). Our core contributions are: (i) the formalization of a heterogeneous stationary MDP model incorporating individual-specific latent variables; and (ii) the design of P4L—a novel algorithm integrating latent-variable modeling, pessimistic Q-value estimation, penalty-based regularization, and personalized Q-function fitting. Under a weak partial coverage assumption, P4L achieves the optimal convergence rate for average regret. Theoretical analysis provides rigorous statistical guarantees. Empirical evaluation on both synthetic benchmarks and real-world healthcare datasets demonstrates that P4L significantly outperforms existing offline RL methods, yielding average individual policy reward improvements of 12.7%–23.4%. The framework thus bridges theoretical rigor with practical efficacy in personalized decision-making under heterogeneity.

Technology Category

Application Category

šŸ“ Abstract
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
Problem

Research questions and friction points this paper is trying to address.

Learning optimal policies from heterogeneous offline data
Overcoming suboptimal policies in heterogeneous populations
Estimating individual Q-functions with latent variables
Innovation

Methods, ideas, or system contributions that make the work stand out.

Individualized offline policy optimization for heterogeneous MDPs
Penalized Pessimistic Personalized Policy Learning (P4L) algorithm
Estimates individual Q-functions with latent variables
šŸ”Ž Similar Papers
No similar papers found.