Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework

📅 2024-09-07
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the exploration-exploitation imbalance and low sample efficiency in sparse-reward reinforcement learning, this paper proposes LMGT: a novel framework that leverages a large language model (LLaMA-3) as an interpretable, non-parametric reward shaper. LMGT dynamically refines reward signals using the LLM’s embedded prior knowledge—e.g., from Wikipedia tutorials—without modifying the environment or designing handcrafted reward functions. Integrating prompt engineering with reward fine-tuning, LMGT is algorithm-agnostic and compatible with mainstream RL methods such as PPO and SAC. We evaluate it on the Housekeep embodied robotics simulation and multiple standard RL benchmarks. Experimental results demonstrate that LMGT significantly improves sample efficiency: it reduces required training samples by 42% and computational overhead by 35%. By transcending the limitations of traditional reward engineering and inverse reinforcement learning, LMGT establishes a new paradigm for knowledge-guided, efficient reinforcement learning.

Technology Category

Application Category

📝 Abstract
The inherent uncertainty in the environmental transition model of Reinforcement Learning (RL) necessitates a delicate balance between exploration and exploitation. This balance is crucial for optimizing computational resources to accurately estimate expected rewards for the agent. In scenarios with sparse rewards, such as robotic control systems, achieving this balance is particularly challenging. However, given that many environments possess extensive prior knowledge, learning from the ground up in such contexts may be redundant. To address this issue, we propose Language Model Guided reward Tuning (LMGT), a novel, sample-efficient framework. LMGT leverages the comprehensive prior knowledge embedded in Large Language Models (LLMs) and their proficiency in processing non-standard data forms, such as wiki tutorials. By utilizing LLM-guided reward shifts, LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior and enhancing sample efficiency. We have rigorously evaluated LMGT across various RL tasks and evaluated it in the embodied robotic environment Housekeep. Our results demonstrate that LMGT consistently outperforms baseline methods. Furthermore, the findings suggest that our framework can substantially reduce the computational resources required during the RL training phase.
Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in RL with uncertain environments
Improving sample efficiency in sparse-reward scenarios like robotics
Leveraging LLM prior knowledge to guide RL reward tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs for reward guidance
Balances exploration and exploitation efficiently
Reduces computational resources in RL
🔎 Similar Papers
No similar papers found.
Y
Yongxin Deng
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
Xihe Qiu
Xihe Qiu
Associate Professor, Shanghai University of Engineering Science
AI for HealthcareVision-Language ModelsReinforcement LearningLarge Language Models
J
Jue Chen
X
Xiaoyu Tan
INF Technology (Shanghai) Co., Ltd. Shanghai, China