🤖 AI Summary
This work addresses the low sample efficiency and poor generalization of self-evolving agents in sparse-reward environments due to ineffective temporal credit assignment. To this end, we propose the Retrospective In-Context Online Learning (RICOL) framework, which, for the first time, leverages the in-context learning capability of large language models (LLMs) for temporal credit assignment. By introducing Retrospective In-Context Learning (RICL), RICOL transforms sparse rewards into dense advantage signals without requiring task-specific value functions, enabling online policy optimization. Evaluated on four BabyAI benchmark tasks, RICOL substantially improves sample efficiency, achieves performance comparable to conventional online reinforcement learning algorithms, and accurately identifies critical decision states, demonstrating superior generalization capabilities.
📝 Abstract
Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (LLMs) to transform sparse rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging LLMs for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.