About Time: Model-free Reinforcement Learning with Timed Reward Machines

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reinforcement learning reward mechanisms struggle to precisely encode temporal constraints, limiting their applicability in time-sensitive tasks. To address this, we propose the Temporal Reward Machine (TRM), the first framework to embed timed automata semantics directly into reward modeling—enabling programmable reward logic such as delay penalties and prompt incentives under both discrete- and continuous-time semantics. Methodologically, TRM integrates temporal abstraction Q-learning, timed automaton specification, and a counterfactual imagination heuristic to achieve model-agnostic, efficient policy optimization. Experiments demonstrate that TRM significantly improves policy performance under strict temporal constraints on standard RL benchmarks. Ablation studies confirm that the counterfactual heuristic accelerates convergence by 42% and increases constraint satisfaction rate by 31%. Our core contribution is the establishment of the first semantically rigorous, interpretable, and scalable time-aware reward modeling paradigm.

Technology Category

Application Category

📝 Abstract
Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.
Problem

Research questions and friction points this paper is trying to address.

Extends reward machines to incorporate precise timing constraints.
Enables expressive specifications with tunable reward logic for time-sensitive applications.
Studies model-free RL frameworks to learn optimal policies under timing constraints.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Timed reward machines extend reward machines with timing constraints
Model-free RL algorithms integrate TRMs via timed automata abstractions
Counterfactual-imagining heuristics exploit TRM structure to improve search
🔎 Similar Papers