🤖 AI Summary
This work addresses the challenge of policy learning from multi-source heterogeneous offline data, where distributional shifts across sites hinder effective generalization. The authors propose an offline reinforcement learning framework grounded in a group-robust Markov decision process. By employing a shared feature mapping to capture both commonality and heterogeneity across sites, the method constructs a feature-level uncertainty set that preserves cross-site structure and enables tractable robust Bellman recursion. Site-wise ridge regression is used to estimate Bellman targets, combined with a worst-case aggregation over features and a data-dependent pessimistic penalty. A clustering-based extension further enhances sample efficiency while circumventing the restrictive state-action rectangularity assumption. Under a robust partial coverage condition, the approach enjoys provable suboptimality bounds for the learned policy, yielding efficient and robust sequential decision-making.
📝 Abstract
We often collect data from multiple sites (e.g., hospitals) that share common structure but also exhibit heterogeneity. This paper aims to learn robust sequential decision-making policies from such offline, multi-site datasets. To model cross-site uncertainty, we study distributionally robust MDPs with a group-linear structure: all sites share a common feature map, and both the transition kernels and expected reward functions are linear in these shared features. We introduce feature-wise (d-rectangular) uncertainty sets, which preserve tractable robust Bellman recursions while maintaining key cross-site structure. Building on this, we then develop an offline algorithm based on pessimistic value iteration that includes: (i) per-site ridge regression for Bellman targets, (ii) feature-wise worst-case (row-wise minimization) aggregation, and (iii) a data-dependent pessimism penalty computed from the diagonals of the inverse design matrices. We further propose a cluster-level extension that pools similar sites to improve sample efficiency, guided by prior knowledge of site similarity. Under a robust partial coverage assumption, we prove a suboptimality bound for the resulting policy. Overall, our framework addresses multi-site learning with heterogeneous data sources and provides a principled approach to robust planning without relying on strong state-action rectangularity assumptions.