🤖 AI Summary
Existing vision-language-action (VLA) models struggle with spatial reasoning when target objects lie outside the field of view, leading to ineffective manipulation. To address this limitation, this work proposes SOMA, a novel framework that introduces, for the first time in VLA models, a persistent spatial memory mechanism grounded in multi-view scanning. SOMA employs a movable camera to construct and maintain cross-view spatial-semantic representations through three core modules: spatial memory construction, dynamic refinement, and context-aware retrieval. By integrating multi-view observations with semantic information, the framework enables instruction-driven access to spatial memory. Evaluated on five real-world tasks, SOMA significantly improves task success rates, facilitates rapid target localization, reduces redundant viewpoint exploration, and enables near-single-attempt grasping under partially observable conditions.
📝 Abstract
We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five challenging real-world out-of-vision manipulation tasks, including multi-step and dual-arm scenarios where target objects are initially invisible. Experimental results show that SOMA not only improves task success rates, but also induces qualitatively different manipulation behaviors, with faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability. Additional experiments on RoboCasa GR1 and SimplerEnv further validate the effectiveness of SOMA's memory design under conventional fully observable settings. Code will be released soon.