STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the dual challenges of explosive variance in importance sampling and accumulating dynamics modeling errors in high-dimensional, long-horizon offline policy evaluation (OPE), this paper proposes a denoising diffusion-based trajectory generation framework. The method avoids explicit environment dynamics modeling and instead leverages policy-score-guided diffusion to synthesize trajectories with low variance and high fidelity. Key contributions include: (1) the first behavioral policy score subtraction mechanism, which mitigates over-regularization in target policy score guidance; and (2) an end-to-end segmented trajectory stitching strategy that overcomes the variance bottleneck of diffusion models in long-horizon trajectory synthesis. Evaluated on D4RL and OpenAI Gym benchmarks, the approach achieves significant reductions in mean squared error, improves evaluation correlation and policy regret, and attains exponential variance reduction over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses high-dimensional long-horizon off-policy evaluation challenges
Reduces variance and errors in synthetic trajectory generation
Improves accuracy over existing OPE methods in robotics and healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses denoising diffusion for trajectory generation
Subtracts behavior policy score to prevent over-regularization
Stitches partial trajectories end-to-end for long-horizon
🔎 Similar Papers
2024-07-18IEEE/RJS International Conference on Intelligent RObots and SystemsCitations: 0