🤖 AI Summary
This work addresses the limitations of existing World Action Models (WAMs), which rely on fixed-length action sequences and struggle to assess consistency between predicted futures and real-world observations, leading to poor responsiveness in contact-rich or complex tasks. The authors formulate adaptive execution as a future-reality consistency verification problem and introduce the Future Forward Dynamics Causal Attention (FFDC) mechanism. FFDC employs a lightweight verifier to dynamically adjust action chunk lengths and incorporates Mixture-of-Horizon Training to enhance coverage of long-horizon trajectories. The approach integrates predicted actions, visual dynamics, real observations, and language instructions, leveraging causal attention for consistency evaluation. Experiments demonstrate that the method reduces forward computation by 69.10%, shortens execution time by 34.02%, and improves success rates by 2.54% on RoboTwin, with a 35% absolute gain in real-robot trials.
📝 Abstract
World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.