🤖 AI Summary
To address two key bottlenecks in robotic cyclic manipulation tasks (e.g., shaking a bottle, hammering a nail)—inadequate historical information modeling and the absence of standardized evaluation benchmarks—this paper introduces the first end-to-end vision-language-action imitation learning framework, accompanied by a dedicated cyclic manipulation benchmark. Methodologically, we enhance historical sequence modeling via cost-aware sampling and employ multi-task learning to jointly optimize action prediction and historical state understanding—without auxiliary modules or hierarchical architectures. Our contributions are: (1) the first open-source, automatically evaluable cyclic manipulation benchmark; and (2) a lightweight, efficient, cross-platform-compatible plug-and-play framework. Experiments demonstrate significant improvements across simulation and real-robot platforms: +23.6% task completion accuracy, 17.4% earlier termination (enhanced timeliness), and superior generalization capability.
📝 Abstract
In this paper, we explore an important yet underexplored task in robot manipulation: cycle-based manipulation, where robots need to perform cyclic or repetitive actions with an expected terminal time. These tasks are crucial in daily life, such as shaking a bottle or knocking a nail. However, few prior works have explored this task, leading to two main challenges: 1) the imitation methods often fail to complete these tasks within the expected terminal time due to the ineffective utilization of history; 2) the absence of a benchmark with sufficient data and automatic evaluation tools hinders development of effective solutions in this area. To address these challenges, we first propose the CycleManip framework to achieve cycle-based task manipulation in an end-to-end imitation manner without requiring any extra models, hierarchical structure or significant computational overhead. The core insight is to enhance effective history perception by a cost-aware sampling strategy and to improve historical understanding by multi-task learning. Second, we introduce a cycle-based task manipulation benchmark, which provides diverse cycle-based tasks, and an automatic evaluation method. Extensive experiments conducted in both simulation and real-world settings demonstrate that our method achieves high success rates in cycle-based task manipulation. The results further show strong adaptability performance in general manipulation, and the plug-and-play ability on imitation policies such as Vision-Language-Action (VLA) models. Moreover, the results show that our approach can be applied across diverse robotic platforms, including bi-arm grippers, dexterous hands, and humanoid robots.