🤖 AI Summary
Existing visual imitation learning methods merely replicate human demonstrations without modeling the underlying causal decision-making mechanisms, resulting in poor generalization and failure to transfer across environments.
Method: We propose a causality-driven imitation learning framework that explicitly incorporates human causal intuition—via intuitive annotations of task-critical causal elements (visual highlights + natural language prompts)—to guide robots in identifying causal visual features. Our approach integrates causal feature filtering, multimodal alignment, and a Transformer-based policy network, supported by a human-robot collaborative annotation interface.
Contribution/Results: This is the first work to explicitly embed human causal reasoning into the imitation learning pipeline, shifting the paradigm from “behavioral cloning” to “causal policy learning.” In both simulation and real-robot experiments, our method surpasses state-of-the-art approaches using fewer demonstrations and achieves significantly higher task success rates in unseen scenarios. A user study confirms its improved interpretability and usability.
📝 Abstract
Today's robots learn new tasks by imitating human examples. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding the features that factor into the human's decisions, robot learners often misinterpret the data and fail to perform the task when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to indicate task-relevant features using markers and language prompts. Our proposed algorithm, CIVIL, leverages this augmented data to filter the robot's visual observations and extract a feature representation that causally informs human actions. CIVIL then applies these causal features to train a transformer-based policy that emulates human behaviors without being confused by visual distractors. Our simulations, real-world experiments, and user study demonstrate that robots trained with CIVIL can learn from fewer human demonstrations and perform better than state-of-the-art baselines, especially in previously unseen scenarios. See videos at our project website: https://civil2025.github.io