🤖 AI Summary
Reward design in reinforcement learning (RL) is highly sensitive and exhibits poor generalization, severely limiting practical deployment. This paper introduces RewardMachine-LLM, an automated reward generation framework that integrates large language models (LLMs) with reward machines (RMs)—finite-state automata encoding reward logic. It is the first method to enable end-to-end parsing from natural language task descriptions to structured RMs, leveraging state-level linguistic embeddings for cross-task semantic alignment and enabling zero-shot reward transfer. Unlike manual reward engineering or supervised reward modeling, RewardMachine-LLM significantly improves task success rates across multiple complex, sparse-reward environments (+32.7% on average), without task-specific fine-tuning. Its core contributions are: (i) language-driven, automatic synthesis of reward structure; (ii) a semantic embedding-based RM state generalization mechanism; and (iii) the first general-purpose framework supporting zero-shot reward migration.
📝 Abstract
Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.