Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language-action (VLA) models in handling long-horizon, non-Markovian tasks that require historical context, as these models typically lack explicit memory mechanisms. To overcome this, the paper introduces a language-based scratchpad into the VLA architecture, enabling the model to explicitly record and update task-relevant information—such as object locations and subgoal states—thereby endowing it with spatiotemporal memory and plan-tracking capabilities. The proposed approach is compatible with both recurrent and non-recurrent architectures and demonstrates significant improvements in generalization across ClevrSkills, MemoryBench, and real-world memory-dependent manipulation tasks. These results highlight the critical role of the language scratchpad in enhancing long-horizon task understanding and execution, effectively transcending the constraints of conventional stateless VLA models.

Technology Category

Application Category

📝 Abstract

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily"stateless"and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Problem

Research questions and friction points this paper is trying to address.

memory-dependent manipulation

non-Markovian tasks

vision-language-action

long-horizon tasks

spatial and temporal memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

language scratchpad

memory-augmented VLA

non-Markovian manipulation