🤖 AI Summary
This work addresses the sample inefficiency of reinforcement learning in sparse or delayed reward environments, where the absence of prior structure hinders effective exploration, and frequent reliance on large language models (LLMs) compromises scalability and reliability. To overcome these limitations, the authors propose MIRA, a method that constructs a structured memory graph integrating high-return agent experiences with LLM-derived knowledge. This memory graph provides early training guidance and generates a decaying soft utility signal to refine advantage estimation—enabling improved policy updates without altering the original reward function. Crucially, MIRA distills LLM priors into a persistent memory, eliminating the need for frequent online LLM queries. Experimental results demonstrate that MIRA significantly outperforms standard RL baselines in sparse-reward tasks, achieving performance comparable to approaches using frequent LLM supervision while drastically reducing LLM invocation frequency.
📝 Abstract
Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high-return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial LLM-derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: https://narjesno.github.io/MIRA/