Adapting Offline Reinforcement Learning with Online Delays

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Offline reinforcement learning (RL) faces two key challenges when deploying policies in real-world environments: (1) latency discrepancies between simulation and reality—violating the Markov assumption—and (2) distributional shift between offline training and online interaction. This paper introduces DT-CORL, the first delay-robust Transformer-based belief policy framework capable of generalizing to dynamic latency environments without requiring latency-labeled data. Its core contributions are: (1) a Transformer-based latent-state belief model that explicitly captures history dependencies induced by latency; (2) constrained offline RL optimization to ensure safety under out-of-distribution states; and (3) a latency-aware action generation mechanism. Evaluated on a novel multi-latency D4RL benchmark, DT-CORL substantially outperforms existing latency-augmentation methods and conventional belief models, achieving significantly improved sample efficiency and effectively bridging the sim-to-real latency gap.

Technology Category

Application Category

📝 Abstract

Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations during training, and (ii) is markedly more sample-efficient than na""ive history-augmentation baselines. Experiments on D4RL benchmarks with several delay settings show that DT-CORL consistently outperforms both history-augmentation and vanilla belief-based methods, narrowing the sim-to-real latency gap while preserving data efficiency.

Problem

Research questions and friction points this paper is trying to address.

Bridging sim-to-real gap with latency in RL deployment

Handling out-of-distribution states in offline-to-online RL

Overcoming Markov assumption breaks due to delayed dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based belief predictor for delay robustness

Constrained Offline RL for dynamic environments

Sample-efficient framework without delayed training data

🔎 Similar Papers

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning