Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Traditional reward machines (RMs) fail in non-Markovian reinforcement learning tasks characterized by long horizons and arbitrarily ordered subtasks, as their state space grows exponentially with the number of subtasks. Method: We propose three generalized RM classes—numerical, agenda-based, and coupled—and introduce CoRM, a Q-learning algorithm grounded in coupled RM structure. CoRM explicitly tracks pending subtasks via an agenda mechanism and employs structured state representations coupled with state-reward modeling to reduce learning complexity. Contribution/Results: Experiments demonstrate that CoRM significantly improves sample efficiency and scalability over existing RM-based methods in unordered subtask settings, particularly excelling in large-scale, long-horizon combinatorial tasks. Its structural coupling and agenda-driven design enable robust generalization while mitigating state-space explosion.

Technology Category

Application Category

📝 Abstract

Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment. This is particularly advantageous for complex non-Markovian tasks because agents with access to RMs can learn more efficiently from fewer samples. However, learning with RMs is ill-suited for long-horizon problems in which a set of subtasks can be executed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. In this work, we address this limitation by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In Agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. Furthermore, we introduce a new compositional learning algorithm that leverages coupled RMs: Q-learning with coupled RMs (CoRM). Our experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.

Problem

Research questions and friction points this paper is trying to address.

Addressing exponential learning complexity in unordered long-horizon reinforcement learning tasks

Overcoming limitations of reward machines for non-Markovian subtask sequencing

Enabling efficient scaling of reward structures for complex unordered objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Numeric RMs compactly express complex task structures

Agenda RMs track remaining subtasks using agendas

Coupled RMs associate states with subtask agendas

🔎 Similar Papers

Reward Machines for Deep RL in Noisy and Uncertain Environments