Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

๐Ÿ“… 2024-12-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address inaccurate credit assignment caused by sparse and delayed episodic rewards in reinforcement learning, this paper proposes LaRe, a symbolic decision-making framework leveraging large language models (LLMs). Methodologically, LaRe introduces a โ€œmulti-dimensional implicit rewardโ€ mechanism that tightly couples LLMsโ€™ semantic knowledge with symbolic execution to enable fine-grained, interpretable performance evaluation and reward redistribution. It further incorporates a self-verification mechanism to mitigate LLM hallucination and theoretically establishes that redundancy elimination improves reward estimation accuracy. Empirically, LaRe achieves significant gains over state-of-the-art methods on temporal credit assignment and multi-agent contribution decomposition tasks; notably, in several scenarios, it even outperforms policies trained with ground-truth rewards.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning (RL) often encounters delayed and sparse feedback in real-world applications, even with only episodic rewards. Previous approaches have made some progress in reward redistribution for credit assignment but still face challenges, including training difficulties due to redundancy and ambiguous attributions stemming from overlooking the multifaceted nature of mission performance evaluation. Hopefully, Large Language Model (LLM) encompasses fruitful decision-making knowledge and provides a plausible tool for reward redistribution. Even so, deploying LLM in this case is non-trivial due to the misalignment between linguistic knowledge and the symbolic form requirement, together with inherent randomness and hallucinations in inference. To tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework, to improve credit assignment. Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation, enabling more interpretable goal attainment from various perspectives and facilitating more effective reward redistribution. We examine that semantically generated code from LLM can bridge linguistic knowledge and symbolic latent rewards, as it is executable for symbolic objects. Meanwhile, we design latent reward self-verification to increase the stability and reliability of LLM inference. Theoretically, reward-irrelevant redundancy elimination in the latent reward benefits RL performance from more accurate reward estimation. Extensive experimental results witness that LaRe (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Reward Allocation
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted Decision Framework
Reward Optimization
Multi-perspective Evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.