Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of low sample efficiency in sparse or delayed reward environments and the scalability and reliability issues associated with frequent large language model (LLM) queries during exploration. The authors propose a memory-graph-based hybrid reinforcement learning framework that integrates LLM-generated subgoals with the agent’s own successful trajectories to construct a memory graph. From this graph, a utility function is derived to shape the advantage function, providing auxiliary guidance to the critic without altering the original reward signal. By adopting an offline-dominant, online-occasional querying mechanism, the method substantially reduces reliance on continuous LLM supervision. Experimental results demonstrate that the approach achieves higher sample efficiency and faster early-stage learning across multiple benchmark tasks, while attaining final performance comparable to methods requiring frequent LLM invocations.

Technology Category

Application Category

📝 Abstract

In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent's trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous LLM supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods that require frequent LLM interaction.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

sparse rewards

large language models

sample complexity

subgoal discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory graph

advantage shaping

LLM-guided RL

subgoal discovery

sample efficiency

🔎 Similar Papers

StepTool: Enhancing Multi-Step Tool Usage in LLMs through Step-Grained Reinforcement Learning

2024-10-10Citations: 1

Authors to Follow