🤖 AI Summary
This work addresses visual localization drift in bimanual robotic fine manipulation caused by data scarcity. Building upon the Action Chunking with Transformers (ACT) framework, the authors propose a multi-stage spatial attention mechanism that leverages a pretrained ResNet to extract task-relevant 2D attention points. To mitigate attention drift without requiring keypoint annotations, they introduce a self-supervised temporal alignment loss that predicts future attention sequences. Integrated with an action chunking strategy and visual priors, the method significantly enhances localization stability and task success rates on the ALOHA platform, while maintaining low inference latency and robustness against visual disturbances.
📝 Abstract
Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT with a pretrained ResNet visual prior, a multistage attention module extracts task-relevant 2D attention points as a local spatial modality for action prediction. To maintain consistent object tracking, we introduce a self-supervised objective that aligns predicted attention sequences with visual features from future frames, suppressing drift without keypoint annotations and improving stability of the vision-to-action mapping under limited data. Experiments on simulated and real-world fine manipulation tasks, conducted on the ALOHA bimanual platform, evaluate task success, attention drift, inference latency, and robustness to visual disturbances. Results indicate improvements in localization stability and task performance while maintaining low-latency inference under the tested conditions.