Leveraging Language Models for Out-of-Distribution Recovery in Reinforcement Learning

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Deep reinforcement learning (DRL) agents often fail when encountering out-of-distribution (OOD) states; existing approaches either focus on OOD avoidance or rely on uncertainty estimation for recovery—both suffering from poor scalability. This paper introduces a novel, uncertainty-agnostic post-OOD recovery paradigm: leveraging multimodal vision-language models (LVLMs) to generate dense reward codes that guide the policy end-to-end back to in-distribution, feasible states. Our method synergistically integrates LVLMs’ capabilities in visual perception, logical reasoning, and code generation with dense reward encoding, policy fine-tuning, and behavior cloning. Evaluated on bio-inspired locomotion tasks, it significantly improves both recovery success rate and efficiency. Notably, it achieves, for the first time, cross-task generalization in challenging domains—including humanoid locomotion and mobile manipulation—demonstrating robustness beyond narrow task-specific recovery.

Technology Category

Application Category

📝 Abstract

Deep Reinforcement Learning (DRL) has demonstrated strong performance in robotic control but remains susceptible to out-of-distribution (OOD) states, often resulting in unreliable actions and task failure. While previous methods have focused on minimizing or preventing OOD occurrences, they largely neglect recovery once an agent encounters such states. Although the latest research has attempted to address this by guiding agents back to in-distribution states, their reliance on uncertainty estimation hinders scalability in complex environments. To overcome this limitation, we introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation. LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task, leveraging the capabilities of LVLMs in image description, logical reasoning, and code generation. Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks and even generalizes effectively to complex environments, including humanoid locomotion and mobile manipulation, where existing methods struggle. The code and supplementary materials are available at href{https://lamour-rl.github.io/}{https://lamour-rl.github.io/}.

Problem

Research questions and friction points this paper is trying to address.

Addresses unreliable actions in DRL from OOD states

Enables recovery learning without uncertainty estimation

Improves recovery efficiency in complex robotic tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Language Models for OOD recovery

Generates dense reward codes for guidance

Leverages LVLMs for image and reasoning tasks

🔎 Similar Papers

Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework