🤖 AI Summary
This work addresses the limited generalization of visual imitation learning in agriculture, which stems from scarce demonstration data and significant visual domain discrepancies between crops and backgrounds. To overcome this, the authors propose DRAIL, a novel framework that introduces a dual-region augmentation mechanism distinguishing task-relevant from task-irrelevant regions. Task-relevant regions are enhanced with domain-informed, feature-preserving augmentations, while task-irrelevant regions undergo strong randomization to suppress distracting visual cues. This strategy effectively decouples essential task features from environmental noise and integrates with a diffusion-based policy for robust visuomotor control. Evaluated on simulated vegetable harvesting and real-world lettuce defect-leaf handling tasks, DRAIL significantly improves policy success rates, focuses attention on critical visual features, and demonstrates superior generalization and robustness.
📝 Abstract
Vision-based imitation learning has shown promise for robotic manipulation; however, its generalization remains limited in practical agricultural tasks. This limitation stems from scarce demonstration data and substantial visual domain gaps caused by i) crop-specific appearance diversity and ii) background variations. To address this limitation, we propose Dual-Region Augmentation for Imitation Learning (DRAIL), a region-aware augmentation framework designed for generalizable vision-based imitation learning in agricultural manipulation. DRAIL explicitly separates visual observations into task-relevant and task-irrelevant regions. The task-relevant region is augmented in a domain-knowledge-driven manner to preserve essential visual characteristics, while the task-irrelevant region is aggressively randomized to suppress spurious background correlations. By jointly handling both sources of visual variation, DRAIL promotes learning policies that rely on task-essential features rather than incidental visual cues. We evaluate DRAIL on diffusion policy-based visuomotor controllers through robot experiments on artificial vegetable harvesting and real lettuce defective leaf picking preparation tasks. The results show consistent improvements in success rates under unseen visual conditions compared to baseline methods. Further attention analysis and representation generalization metrics indicate that the learned policies rely more on task-essential visual features, resulting in enhanced robustness and generalization.