Boosting Universal LLM Reward Design through the Heuristic Reward Observation Space Evolution

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to dynamically optimize the Reward Observation Space (ROS) by leveraging historical exploration data and natural-language task descriptions, limiting the potential of Large Language Models (LLMs) in automated reward design. To address this, we propose ROS-LLM: a novel framework for LLM-driven ROS construction. First, it introduces a state-execution table–based ROS evolution mechanism, explicitly relaxing the Markov assumption inherent in standard LLM dialogue. Second, it designs a text-code co-alignment strategy to semantically unify user intent with expert-defined success criteria. Third, it integrates tabular exploration caching, structured prompt engineering, and a bidirectional verification mechanism to enable closed-loop, iterative ROS refinement. Evaluated across multiple reinforcement learning benchmarks, ROS-LLM significantly improves reward function generalizability and training stability. The implementation—including source code and demonstration videos—is publicly available.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are emerging as promising tools for automated reinforcement learning (RL) reward design, owing to their robust capabilities in commonsense reasoning and code generation. By engaging in dialogues with RL agents, LLMs construct a Reward Observation Space (ROS) by selecting relevant environment states and defining their internal operations. However, existing frameworks have not effectively leveraged historical exploration data or manual task descriptions to iteratively evolve this space. In this paper, we propose a novel heuristic framework that enhances LLM-driven reward design by evolving the ROS through a table-based exploration caching mechanism and a text-code reconciliation strategy. Our framework introduces a state execution table, which tracks the historical usage and success rates of environment states, overcoming the Markovian constraint typically found in LLM dialogues and facilitating more effective exploration. Furthermore, we reconcile user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives. Comprehensive evaluations on benchmark RL tasks demonstrate the effectiveness and stability of the proposed framework. Code and video demos are available at jingjjjjjie.github.io/LLM2Reward.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM-driven reward design via heuristic ROS evolution
Overcoming Markovian constraints in LLM dialogues for RL
Aligning user task descriptions with expert success criteria
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolving Reward Observation Space via exploration caching
State execution table tracks historical usage rates
Reconcile task descriptions with expert criteria
Z
Zen Kit Heng
Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing 100871, China, also with PKU-Agibot Lab, School of Computer Science, Peking University, Beijing 100871, China, and also with National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing 100871, China
Zimeng Zhao
Zimeng Zhao
School of Automation, Southeast University, Nanjing, China
T
Tianhao Wu
Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing 100871, China, also with PKU-Agibot Lab, School of Computer Science, Peking University, Beijing 100871, China, and also with National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing 100871, China
Yuanfei Wang
Yuanfei Wang
Peking University
robot learningreinforcement learning
Mingdong Wu
Mingdong Wu
Peking University
Embodied AIReinforcement LearningGenerative Model
Yangang Wang
Yangang Wang
Professor, Southeast University
Computer graphicsComputer visionComputational photography
H
Hao Dong
Center on Frontiers of Computing Studies, School of Computer Science, Peking University, Beijing 100871, China, also with PKU-Agibot Lab, School of Computer Science, Peking University, Beijing 100871, China, and also with National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing 100871, China