🤖 AI Summary
Existing imitation learning approaches exhibit poor generalization when deploying robotic policies across diverse environments, primarily due to spurious environmental confounders—irrelevant visual or sensory elements—in the observations, which distort causal modeling of task-relevant dynamics. To address this, we propose a lightweight causal structure learning framework that directly identifies and models causal dependencies between observable components and expert actions from raw observations, without requiring explicit image feature disentanglement. Our method employs intervention-based policy learning to estimate the causal structure function and integrates seamlessly with modern architectures such as Action Chunking Transformer. Evaluated in the ALOHA bimanual robot MuJoCo simulation environment, our approach significantly improves robustness of action prediction under domain shift: it achieves an average 23.6% gain in generalization performance across multiple environmental perturbations, including lighting changes, background clutter, and camera viewpoint shifts.
📝 Abstract
Recent developments in imitation learning have considerably advanced robotic manipulation. However, current techniques in imitation learning can suffer from poor generalization, limiting performance even under relatively minor domain shifts. In this work, we aim to enhance the generalization capabilities of complex imitation learning algorithms to handle unpredictable changes from the training environments to deployment environments. To avoid confusion caused by observations that are not relevant to the target task, we propose to explicitly learn the causal relationship between observation components and expert actions, employing a framework similar to [6], where a causal structural function is learned by intervention on the imitation learning policy. Disentangling the feature representation from image input as in [6] is hard to satisfy in complex imitation learning process in robotic manipulation, we theoretically clarify that this requirement is not necessary in causal relationship learning. Therefore, we propose a simple causal structure learning framework that can be easily embedded in recent imitation learning architectures, such as the Action Chunking Transformer [31]. We demonstrate our approach using a simulation of the ALOHA [31] bimanual robot arms in Mujoco, and show that the method can considerably mitigate the generalization problem of existing complex imitation learning algorithms.