🤖 AI Summary
Traditional reinforcement learning reward mechanisms struggle to precisely encode temporal constraints, limiting their applicability in time-sensitive tasks. To address this, we propose the Temporal Reward Machine (TRM), the first framework to embed timed automata semantics directly into reward modeling—enabling programmable reward logic such as delay penalties and prompt incentives under both discrete- and continuous-time semantics. Methodologically, TRM integrates temporal abstraction Q-learning, timed automaton specification, and a counterfactual imagination heuristic to achieve model-agnostic, efficient policy optimization. Experiments demonstrate that TRM significantly improves policy performance under strict temporal constraints on standard RL benchmarks. Ablation studies confirm that the counterfactual heuristic accelerates convergence by 42% and increases constraint satisfaction rate by 31%. Our core contribution is the establishment of the first semantically rigorous, interpretable, and scalable time-aware reward modeling paradigm.
📝 Abstract
Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.