CIVIL: Causal and Intuitive Visual Imitation Learning

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing visual imitation learning methods merely replicate human demonstrations without modeling the underlying causal decision-making mechanisms, resulting in poor generalization and failure to transfer across environments. Method: We propose a causality-driven imitation learning framework that explicitly incorporates human causal intuition—via intuitive annotations of task-critical causal elements (visual highlights + natural language prompts)—to guide robots in identifying causal visual features. Our approach integrates causal feature filtering, multimodal alignment, and a Transformer-based policy network, supported by a human-robot collaborative annotation interface. Contribution/Results: This is the first work to explicitly embed human causal reasoning into the imitation learning pipeline, shifting the paradigm from “behavioral cloning” to “causal policy learning.” In both simulation and real-robot experiments, our method surpasses state-of-the-art approaches using fewer demonstrations and achieves significantly higher task success rates in unseen scenarios. A user study confirms its improved interpretability and usability.

Technology Category

Application Category

📝 Abstract

Today's robots learn new tasks by imitating human examples. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding the features that factor into the human's decisions, robot learners often misinterpret the data and fail to perform the task when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to indicate task-relevant features using markers and language prompts. Our proposed algorithm, CIVIL, leverages this augmented data to filter the robot's visual observations and extract a feature representation that causally informs human actions. CIVIL then applies these causal features to train a transformer-based policy that emulates human behaviors without being confused by visual distractors. Our simulations, real-world experiments, and user study demonstrate that robots trained with CIVIL can learn from fewer human demonstrations and perform better than state-of-the-art baselines, especially in previously unseen scenarios. See videos at our project website: https://civil2025.github.io

Problem

Research questions and friction points this paper is trying to address.

Robots misinterpret human actions without understanding decision features

Visual imitation learning lacks insight into human decision-making cues

Current methods fail in new environments due to distractors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses markers and language prompts for human guidance

Filters visual observations with causal features

Trains transformer-based policy to avoid distractors

🔎 Similar Papers

No similar papers found.