Reward Prediction with Factorized World States

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Traditional supervised reward models are highly susceptible to data bias and struggle to generalize across novel goals and environments. This work proposes StateFactory, a method that leverages large language models to transform raw observations into structured, hierarchical object-attribute state representations. Reward prediction is then performed by measuring semantic similarity between the current state and the goal state, enabling zero-shot cross-domain generalization using only factorized world states. Evaluated on the new RewardPrediction benchmark, StateFactory reduces EPIC distance by 60% compared to VLWM-critic and by 8% relative to LLM-as-a-Judge. Furthermore, it achieves absolute improvements of 21.64% and 12.40% in task planning success rates on AlfWorld and ScienceWorld, respectively.

Technology Category

Application Category

📝 Abstract

Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: https://statefactory.github.io

Problem

Research questions and friction points this paper is trying to address.

reward prediction

world state representation

generalization

zero-shot learning

structured representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

factorized world states

reward prediction

structured representation