🤖 AI Summary
This work addresses the challenge of achieving both temporal smoothness and spatial consistency in robot motion under high-frequency control (e.g., 60 Hz), where conventional motion chunking approaches fail. To this end, it introduces the first approach that transfers high-frequency continuous motion learning into the latent space of a variational autoencoder (VAE). The proposed method employs a chunk-level Reuse-then-Refine mechanism to enhance continuity between adjacent motion segments during asynchronous inference. This strategy substantially improves spatiotemporal coherence and execution fluency at high control rates, enabling stable, smooth, and uninterrupted real-time operation in three real-world, contact-intensive tasks while effectively suppressing motion jitter and discontinuities.
📝 Abstract
Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.