🤖 AI Summary
This work addresses the challenges of modeling state uncertainty and unbounded context growth in long-horizon partially observable tasks for large language models. The authors propose a dual-model cooperative architecture: a belief-state model explicitly represents environmental beliefs using structured, learnable natural language atomic propositions, each augmented with a trainable confidence score; a policy model then makes decisions based on this compact belief representation. This approach is the first to explicitly model both beliefs and their uncertainties in natural language form, effectively decoupling context length from task duration. Experiments demonstrate that the method significantly outperforms strong baselines—by +14.5% over Qwen2.5-3B-Instruct and +5.3% over Qwen3-4B-Instruct—across multiple long-horizon partially observable environments, maintains a nearly constant context window, and progressively calibrates beliefs as evidence accumulates.
📝 Abstract
Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.