OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the challenge that existing world models for robotic action struggle to accurately ground instruction-referred targets under scene variations due to entangled object identity and contextual representations. To overcome this, the authors propose an addressable object-centric world model that decomposes each frame into a robot slot and multiple object slots, explicitly disentangling identity (via address vectors) from state (via content vectors). The model integrates multimodal inputs—text instructions, images, proprioception, and action history—through block-causal sequence modeling, and employs address-guided cross-attention along with a flow-matching action head. It achieves state-of-the-art performance on LIBERO (97.8%), SimplerEnv (79.3%), and the geometric-axis tasks of LIBERO-Plus. Causal intervention tests further demonstrate strong disentanglement, with a swap-binding cosine similarity of 0.87, substantially outperforming baseline approaches.

📝 Abstract

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.

Problem

Research questions and friction points this paper is trying to address.

World Action Models

object addressability

scene shifts

robot manipulation

object identity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-Addressable

Slot-based Representation

World Action Model