🤖 AI Summary
Existing vision-language-action (VLA) models suffer significant performance degradation in dynamic environments due to insufficient temporal awareness and a lack of relevant training data. To address this, this work proposes PUMA, a dynamics-aware VLA architecture, along with DOMINO, a large-scale dynamic manipulation dataset, and the first hierarchical task benchmark for general-purpose dynamic manipulation. PUMA implicitly predicts short-term future states of target objects by integrating scene-centric historical optical flow encoding, object-centric world querying, and expert trajectory imitation learning, thereby enhancing spatiotemporal generalization. Experiments demonstrate that PUMA achieves state-of-the-art performance on dynamic tasks, improving success rates by 6.3% over baseline methods, and its dynamics-trained representations effectively transfer to static manipulation tasks.
📝 Abstract
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.