🤖 AI Summary
This paper addresses the zero-shot compositional generalization challenge in reinforcement learning—specifically, the failure of value estimation when encountering unseen state compositions formed from known primitive elements (e.g., pedestrians, vehicles). We formally define this task and propose a novel behavior cloning paradigm based on conditional diffusion models. Unlike conventional value-based methods, our approach leverages expert trajectories as supervision to directly generate policy actions tailored to unseen compositional states. Extensive experiments across three heterogeneous domains—maze navigation, autonomous driving, and multi-agent coordination—demonstrate that our method significantly outperforms mainstream RL algorithms. It exhibits strong cross-compositional generalization and domain transferability, enabling effective decision-making under open-world conditions. This work establishes a scalable, generative modeling framework for compositional generalization in sequential decision-making, advancing robustness and adaptability in real-world RL applications.
📝 Abstract
Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements. In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on expert trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.