STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing robotic policies struggle to effectively model action-centric spatiotemporal interaction structures, limiting their capacity for future reasoning in complex manipulation tasks. This work proposes STARRY—an action-centered world model–enhanced policy that jointly leverages a denoising diffusion model to simultaneously predict future spatiotemporal latent variables and action sequences. To achieve action-oriented spatiotemporal modeling, STARRY introduces a geometry-aware selective attention modulation mechanism that fuses predicted depth and end-effector geometric information into action-specific attention weights. Evaluated on RoboTwin 2.0, the method achieves average success rates of 93.82% and 93.30%, and real-world robot experiments demonstrate a significant improvement in task success rate from 42.5% to 70.8%.

📝 Abstract

Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation to convert predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings. Real-world experiments further improve average success from 42.5% to 70.8% over $π_{0.5}$, demonstrating the effectiveness of action-centric spatial-temporal world modeling for spatial-temporally demanding robotic action generation.

Problem

Research questions and friction points this paper is trying to address.

spatial-temporal modeling

robotic manipulation

action-centric reasoning

world modeling

VLA policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial-temporal world modeling

action-centric reasoning

geometry-aware attention