Good Actions Succeed, Bad Actions Generalize: A Case Study on Why RL Generalizes Better

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the fundamental reasons underlying the superior zero-shot generalization of reinforcement learning (RL) over supervised learning (SL) in visual navigation. Using the Habitat platform, we systematically compare behavior cloning (BC) and PPO across diverse unseen environments. Contrary to expectations, RL consistently outperforms SL in success rate and SPL—even when BC is trained on expert demonstrations—indicating that imitation alone cannot replicate RL’s generalization capability. We provide the first empirical evidence that RL’s robust generalization stems from “experience stitching”: dynamically recombining fragments from numerous failed trajectories to construct novel, adaptive policies, rather than merely imitating successful demonstrations. Based on this insight, we propose two algorithmic design principles for improved generalization: (i) incentivizing exploratory failures to enrich trajectory diversity, and (ii) enabling cross-task reuse and recomposition of behavioral primitives. These findings offer a new theoretical foundation and practical guidelines for developing general-purpose embodied AI agents.

Technology Category

Application Category

📝 Abstract
Supervised learning (SL) and reinforcement learning (RL) are both widely used to train general-purpose agents for complex tasks, yet their generalization capabilities and underlying mechanisms are not yet fully understood. In this paper, we provide a direct comparison between SL and RL in terms of zero-shot generalization. Using the Habitat visual navigation task as a testbed, we evaluate Proximal Policy Optimization (PPO) and Behavior Cloning (BC) agents across two levels of generalization: state-goal pair generalization within seen environments and generalization to unseen environments. Our experiments show that PPO consistently outperforms BC across both zero-shot settings and performance metrics-success rate and SPL. Interestingly, even though additional optimal training data enables BC to match PPO's zero-shot performance in SPL, it still falls significantly behind in success rate. We attribute this to a fundamental difference in how models trained by these algorithms generalize: BC-trained models generalize by imitating successful trajectories, whereas TD-based RL-trained models generalize through combinatorial experience stitching-leveraging fragments of past trajectories (mostly failed ones) to construct solutions for new tasks. This allows RL to efficiently find solutions in vast state space and discover novel strategies beyond the scope of human knowledge. Besides providing empirical evidence and understanding, we also propose practical guidelines for improving the generalization capabilities of RL and SL through algorithm design.
Problem

Research questions and friction points this paper is trying to address.

Compare SL and RL generalization in zero-shot tasks.
Evaluate PPO and BC in seen and unseen environments.
Explain why RL generalizes better through experience stitching.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proximal Policy Optimization outperforms Behavior Cloning
RL generalizes via combinatorial experience stitching
RL discovers novel strategies beyond human knowledge
🔎 Similar Papers
No similar papers found.