AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

πŸ“… 2026-04-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

215K/year
πŸ€– AI Summary
Existing unified world models struggle to reliably decode robot actions from pretrained video generation models due to the absence of explicit modeling of interaction locations and manipulation intent. This work proposes AIM, a novel approach that introduces spatial value maps as an explicit interface within a unified framework for the first time. AIM employs a hybrid Transformer architecture to jointly model future visual observations and task-relevant interaction structures, and incorporates an intention-aware causal attention mechanism that routes future information through the value maps to the action branch. Combined with a self-distillation reinforcement learning strategy, AIM effectively decouples yet coordinates visual prediction and action generation. Evaluated on the RoboTwin 2.0 benchmark, AIM achieves a 94.0% average success rate, substantially outperforming existing methods, particularly excelling in long-horizon and contact-sensitive tasks.

Technology Category

Application Category

πŸ“ Abstract
Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.
Problem

Research questions and friction points this paper is trying to address.

world action modeling
spatial reasoning
manipulation intent
video generation
robot control
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial value maps
intent-aware modeling
unified world action model
video-conditioned control
self-distillation reinforcement learning
πŸ”Ž Similar Papers
No similar papers found.
L
Liaoyuan Fan
INFIFORCE Intelligent Technology Co., Ltd., Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China
Z
Zetian Xu
INFIFORCE Intelligent Technology Co., Ltd., Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China
C
Chen Cao
INFIFORCE Intelligent Technology Co., Ltd., Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China
Wenyao Zhang
Wenyao Zhang
PhD Student, Shanghai Jiaotong University
Robot Learning, Representation Learning
Mingqi Yuan
Mingqi Yuan
PhD candidate at HKPU
Machine Learning
J
Jiayu Chen
INFIFORCE Intelligent Technology Co., Ltd., Hangzhou, China; The University of Hong Kong, Hong Kong SAR, China