🤖 AI Summary
Addressing the dual challenges of explosive variance in importance sampling and accumulating dynamics modeling errors in high-dimensional, long-horizon offline policy evaluation (OPE), this paper proposes a denoising diffusion-based trajectory generation framework. The method avoids explicit environment dynamics modeling and instead leverages policy-score-guided diffusion to synthesize trajectories with low variance and high fidelity. Key contributions include: (1) the first behavioral policy score subtraction mechanism, which mitigates over-regularization in target policy score guidance; and (2) an end-to-end segmented trajectory stitching strategy that overcomes the variance bottleneck of diffusion models in long-horizon trajectory synthesis. Evaluated on D4RL and OpenAI Gym benchmarks, the approach achieves significant reductions in mean squared error, improves evaluation correlation and policy regret, and attains exponential variance reduction over state-of-the-art methods.
📝 Abstract
Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.