🤖 AI Summary
This work addresses offline imitation learning from imperfect human demonstrations—characterized by noisy actions and suboptimal policies—by proposing Counterfactual Behavior Cloning (Counter-BC). The method explicitly models and corrects demonstration bias through counterfactual action generation, producing semantically consistent augmentations that disentangle and recover the demonstrator’s underlying intent policy from superficial behavioral traces. Built upon the behavior cloning framework, Counter-BC integrates counterfactual action sampling, consistency regularization, and a theoretically grounded policy disentanglement mechanism to enable robust optimization in the offline setting. Experiments on both simulated and real-world robotic platforms demonstrate that Counter-BC significantly outperforms existing baselines. It reliably recovers concise, consistent, and generalizable policies from highly noisy, multi-user, and low-skill demonstrations—without requiring online interaction or reward labels.
📝 Abstract
Learning from humans is challenging because people are imperfect teachers. When everyday humans show the robot a new task they want it to perform, humans inevitably make errors (e.g., inputting noisy actions) and provide suboptimal examples (e.g., overshooting the goal). Existing methods learn by mimicking the exact behaviors the human teacher provides -- but this approach is fundamentally limited because the demonstrations themselves are imperfect. In this work we advance offline imitation learning by enabling robots to extrapolate what the human teacher meant, instead of only considering what the human actually showed. We achieve this by hypothesizing that all of the human's demonstrations are trying to convey a single, consistent policy, while the noise and sub-optimality within their behaviors obfuscates the data and introduces unintentional complexity. To recover the underlying policy and learn what the human teacher meant, we introduce Counter-BC, a generalized version of behavior cloning. Counter-BC expands the given dataset to include actions close to behaviors the human demonstrated (i.e., counterfactual actions that the human teacher could have intended, but did not actually show). During training Counter-BC autonomously modifies the human's demonstrations within this expanded region to reach a simple and consistent policy that explains the underlying trends in the human's dataset. Theoretically, we prove that Counter-BC can extract the desired policy from imperfect data, multiple users, and teachers of varying skill levels. Empirically, we compare Counter-BC to state-of-the-art alternatives in simulated and real-world settings with noisy demonstrations, standardized datasets, and real human teachers. See videos of our work here: https://youtu.be/XaeOZWhTt68