Kinaema: a recurrent sequence model for memory and pose in motion

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenge of efficient self-localization for robots operating in large-scale environments using historical visual observations. We propose Kinaema, a novel model that constructs a recurrently updated implicit latent memory—circumventing explicit storage of past frames and overcoming the context-length limitations inherent to standard attention mechanisms. By integrating Transformer-based sequence modeling with implicit memory compression, Kinaema directly learns spatial relationships from continuous visual streams and outputs the six-degree-of-freedom relative pose of a query image with respect to the robot’s current pose. Evaluated on the newly introduced Mem-Nav navigation benchmark, Kinaema achieves significantly improved long-range spatial localization accuracy and inference efficiency, while maintaining low memory footprint and minimal computational overhead—outperforming conventional attention-based baselines.

Technology Category

Application Category

📝 Abstract

One key aspect of spatially aware robots is the ability to "find their bearings", ie. to correctly situate themselves in previously seen spaces. In this work, we focus on this particular scenario of continuous robotics operations, where information observed before an actual episode start is exploited to optimize efficiency. We introduce a new model, Kinaema, and agent, capable of integrating a stream of visual observations while moving in a potentially large scene, and upon request, processing a query image and predicting the relative position of the shown space with respect to its current position. Our model does not explicitly store an observation history, therefore does not have hard constraints on context length. It maintains an implicit latent memory, which is updated by a transformer in a recurrent way, compressing the history of sensor readings into a compact representation. We evaluate the impact of this model in a new downstream task we call "Mem-Nav". We show that our large-capacity recurrent model maintains a useful representation of the scene, navigates to goals observed before the actual episode start, and is computationally efficient, in particular compared to classical transformers with attention over an observation history.

Problem

Research questions and friction points this paper is trying to address.

Develops recurrent model for robot self-localization in known environments

Enables pose prediction using compressed latent memory without explicit history

Addresses memory-efficient navigation in large-scale continuous robotic operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent transformer model for implicit latent memory

Processes query images for relative position prediction

Compresses sensor history into compact scene representation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs