CIVIL: Causal and Intuitive Visual Imitation Learning

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual imitation learning methods merely replicate human demonstrations without modeling the underlying causal decision-making mechanisms, resulting in poor generalization and failure to transfer across environments. Method: We propose a causality-driven imitation learning framework that explicitly incorporates human causal intuition—via intuitive annotations of task-critical causal elements (visual highlights + natural language prompts)—to guide robots in identifying causal visual features. Our approach integrates causal feature filtering, multimodal alignment, and a Transformer-based policy network, supported by a human-robot collaborative annotation interface. Contribution/Results: This is the first work to explicitly embed human causal reasoning into the imitation learning pipeline, shifting the paradigm from “behavioral cloning” to “causal policy learning.” In both simulation and real-robot experiments, our method surpasses state-of-the-art approaches using fewer demonstrations and achieves significantly higher task success rates in unseen scenarios. A user study confirms its improved interpretability and usability.

Technology Category

Application Category

📝 Abstract
Today's robots learn new tasks by imitating human examples. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding the features that factor into the human's decisions, robot learners often misinterpret the data and fail to perform the task when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to indicate task-relevant features using markers and language prompts. Our proposed algorithm, CIVIL, leverages this augmented data to filter the robot's visual observations and extract a feature representation that causally informs human actions. CIVIL then applies these causal features to train a transformer-based policy that emulates human behaviors without being confused by visual distractors. Our simulations, real-world experiments, and user study demonstrate that robots trained with CIVIL can learn from fewer human demonstrations and perform better than state-of-the-art baselines, especially in previously unseen scenarios. See videos at our project website: https://civil2025.github.io
Problem

Research questions and friction points this paper is trying to address.

Robots misinterpret human actions without understanding decision features
Visual imitation learning lacks insight into human decision-making cues
Current methods fail in new environments due to distractors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses markers and language prompts for human guidance
Filters visual observations with causal features
Trains transformer-based policy to avoid distractors
🔎 Similar Papers
No similar papers found.
Y
Yinlong Dai
Dept. of Mechanical Engineering, Virginia Tech
R
Robert Ramirez Sanchez
Dept. of Mechanical Engineering, Virginia Tech
R
Ryan Jeronimus
Dept. of Mechanical Engineering, Virginia Tech
Shahabedin Sagheb
Shahabedin Sagheb
Assistant Collegiate Professor, Virginia Tech
Robot LearningMachine LearningControl TheoryHapticsGame Theory
C
Cara M. Nunez
Sibley School of Mechanical and Aerospace Engineering, Cornell University
Heramb Nemlekar
Heramb Nemlekar
California State University, Northridge (CSUN)
roboticshuman-robot interactionimitation learning
D
Dylan P. Losey
Dept. of Mechanical Engineering, Virginia Tech