Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge in conventional reinforcement learning, which often relies on expert demonstrations or explicit reward signals and struggles to learn efficiently in their absence. The study introduces, for the first time, the psychological mechanism of vicarious conditioning into deep reinforcement learning, formulating an intrinsic reward signal through a four-stage process—attention, retention, reproduction, and reinforcement—without requiring external policies or rewards. By integrating a memory-based architecture with off-policy learning, the proposed method significantly extends agent survival time in MiniWorld Sidewalk and Box2D CarRacing environments, effectively avoids non-informative terminal states, and steers behavior toward advantageous outcomes. Empirical results demonstrate its efficacy and generalization capability in few-shot, single-life, and continual learning scenarios.

📝 Abstract

Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents' capacity by restricting how they learn from the environment as well as from others. Off-policy or learn-by-example methods can learn from demonstrators' representations, but they require access to the demonstrating agent's policies or their reward functions. Our work overcomes this direct sampling limitation by introducing vicarious conditioning as an intrinsic reward mechanism. We draw from psychological and biological literature to provide a foundation for vicarious conditioning and use memory-based methods to implement its four steps: attention, retention, reproduction, and reinforcement. Crucially, our vicarious conditioning paradigms support low-shot learning and do not require the demonstrator agent's policy nor its reward functions. We evaluate our approach in the MiniWorld Sidewalk environment, one of the few public environments that features a non-descriptive terminal condition (no reward provided upon agent death), and extend it to Box2D's CarRacing environment. Our results across both environments demonstrate that vicarious conditioning enables longer episode lengths by discouraging the agent from non-descriptive terminal conditions and guiding the agent toward desirable states. Overall, this work emulates a cognitively-plausible learning paradigm better suited to problems such as single-life learning or continual learning.

Problem

Research questions and friction points this paper is trying to address.

intrinsic reward

vicarious conditioning

reinforcement learning

imitation learning

low-shot learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

vicarious conditioning

intrinsic reward

imitation learning