🤖 AI Summary
Causal confounding—where models erroneously interpret spurious correlations as causal relationships—is prevalent in imitation learning, leading to degraded generalization under distributional shift. To address this, we propose GABRIL, the first method to leverage human expert gaze data as a causal feature guidance signal. GABRIL incorporates a gaze-region regularization loss into representation learning to mitigate confounder interference, thereby enhancing both causal robustness and interpretability. Integrated within a supervised imitation learning framework, GABRIL is evaluated on Atari and CARLA benchmarks. It achieves 179% and 76% task performance gains over behavioral cloning, respectively, and significantly outperforms existing baselines. Our core contribution lies in pioneering the use of eye-tracking data for causal disentanglement in imitation learning—establishing a novel paradigm that jointly improves explainability and out-of-distribution generalization.
📝 Abstract
Imitation Learning (IL) is a widely adopted approach which enables agents to learn from human expert demonstrations by framing the task as a supervised learning problem. However, IL often suffers from causal confusion, where agents misinterpret spurious correlations as causal relationships, leading to poor performance in testing environments with distribution shift. To address this issue, we introduce GAze-Based Regularization in Imitation Learning (GABRIL), a novel method that leverages the human gaze data gathered during the data collection phase to guide the representation learning in IL. GABRIL utilizes a regularization loss which encourages the model to focus on causally relevant features identified through expert gaze and consequently mitigates the effects of confounding variables. We validate our approach in Atari environments and the Bench2Drive benchmark in CARLA by collecting human gaze datasets and applying our method in both domains. Experimental results show that the improvement of GABRIL over behavior cloning is around 179% more than the same number for other baselines in the Atari and 76% in the CARLA setup. Finally, we show that our method provides extra explainability when compared to regular IL agents.