🤖 AI Summary
Large language models often fall into a vicious cycle of ineffective actions and state drift in closed embodied environments due to strict action preconditions and sparse feedback. This work proposes the RPMS architecture, which ensures action feasibility through structured rule retrieval, employs a lightweight belief-state filter to assess the applicability of episodic memory, and introduces a rule-priority arbitration mechanism to resolve conflicts between rules and memory. The study reveals, for the first time, that episodic memory must be filtered by belief state and constrained by rules to exert a positive effect, thereby enabling synergistic enhancement between rules and memory. On ALFWorld, Llama-3.1 8B achieves a 23.9-percentage-point improvement in single-run success rate, reaching 59.7% (with rule retrieval contributing +14.9 pp); on ScienceWorld, GPT-4’s average score increases from 44.9 to 54.0.
📝 Abstract
LLM agents often fail in closed-world embodied environments because actions must satisfy strict preconditions -- such as location, inventory, and container states -- and failure feedback is sparse. We identify two structurally coupled failure modes: (P1) invalid action generation and (P2) state drift, each amplifying the other in a degenerative cycle. We present RPMS, a conflict-managed architecture that enforces action feasibility via structured rule retrieval, gates memory applicability via a lightweight belief state, and resolves conflicts between the two sources via rules-first arbitration. On ALFWorld (134 unseen tasks), RPMS achieves 59.7% single-trial success with Llama 3.1 8B (+23.9 pp over baseline) and 98.5% with Claude Sonnet 4.5 (+11.9 pp); of the 8B gain, rule retrieval alone contributes +14.9 pp (statistically significant), making it the dominant factor. A key finding is that episodic memory is conditionally useful: it harms performance on some task types when used without grounding, but becomes a stable net positive once filtered by current state and constrained by explicit action rules. Adapting RPMS to ScienceWorld with GPT-4 yields consistent gains across all ablation conditions (avg. score 54.0 vs. 44.9 for the ReAct baseline), providing transfer evidence that the core mechanisms hold across structurally distinct environments.