🤖 AI Summary
In partially observable environments, incomplete observations induce belief drift, and delayed rewards exacerbate the temporal credit assignment challenge in long-horizon tasks. To address this, this work proposes ReBel, an algorithm that innovatively shifts reward attribution from actions to structured belief states. ReBel generates dense self-supervised signals by leveraging inconsistencies between belief predictions and feedback, and further incorporates belief-consistency supervision alongside a belief-aware trajectory grouping mechanism to enable fully self-supervised reinforcement learning without external annotations. Evaluated on benchmarks such as ALFWorld and WebShop, ReBel achieves up to a 20.4 percentage point improvement in task success rate and demonstrates a 2.1× gain in sample efficiency.
📝 Abstract
Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.