🤖 AI Summary
Visual motor policies often struggle to model long-term dependencies and recurrent states due to the Markov assumption; existing approaches that extend the observation window lack flexibility in accommodating diverse memory requirements. To address this, we propose a point-tracking–based, object-centric historical representation method that abstracts past observations into compact, structured object-level trajectory sequences. Leveraging lightweight encoding and aggregation modules, our approach unifies support for multiple memory functions—including task-phase recognition, spatial memory maintenance, and action counting—while enabling both continuous memory updating and pre-loaded memory initialization. Crucially, it requires no modification to downstream policy architectures and can be seamlessly integrated into mainstream visual motor policies. Evaluated on multiple embodied manipulation tasks, our method significantly outperforms Markovian baselines and existing history-aware approaches, achieving substantial improvements in task completion rate and decision accuracy.
📝 Abstract
Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements - such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory - and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: http://tonyfang.net/history