🤖 AI Summary
To address the challenges of poor cross-environment generalization of vision–motor policies in embodied intelligence, high cost of acquiring large-scale labeled data, and structural degradation of task-relevant features caused by global data augmentation, this paper proposes EAGLE. Methodologically, EAGLE introduces (1) control-aware, mask-guided local augmentation, enabled by self-supervised identification of image regions critical for motor control; and (2) a lightweight vision–motor policy knowledge distillation mechanism that enables zero-shot, fine-tuning-free transfer. Evaluated on the DMControl generalization benchmark, an enhanced robotic perturbation benchmark, and a long-horizon drawer-opening task, EAGLE achieves significant improvements over state-of-the-art methods: it boosts average generalization performance by 23% and accelerates training convergence by 1.8×.
📝 Abstract
Improving generalization is one key challenge in embodied AI, where obtaining large-scale datasets across diverse scenarios is costly. Traditional weak augmentations, such as cropping and flipping, are insufficient for improving a model's performance in new environments. Existing data augmentation methods often disrupt task-relevant information in images, potentially degrading performance. To overcome these challenges, we introduce EAGLE, an efficient training framework for generalizable visuomotor policies that improves upon existing methods by (1) enhancing generalization by applying augmentation only to control-related regions identified through a self-supervised control-aware mask and (2) improving training stability and efficiency by distilling knowledge from an expert to a visuomotor student policy, which is then deployed to unseen environments without further fine-tuning. Comprehensive experiments on three domains, including the DMControl Generalization Benchmark, the enhanced Robot Manipulation Distraction Benchmark, and a long-sequential drawer-opening task, validate the effectiveness of our method.