Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional Transformers in long-sequence video understanding, which stem from the absence of explicit spatial memory and hinder robust reasoning under occlusion and over extended temporal horizons. To overcome this, the authors propose a lightweight module that embeds a fixed-size recurrent 3D memory tensor within the Transformer architecture. This module employs a differentiable soft-write mechanism and gated recurrent dynamics to decouple memory capacity from input sequence length while preserving spatial inductive biases. Key technical components include Gaussian-weighted voxel writing, local interaction operators, continuous sampling-based reading, and gated residual fusion. The approach demonstrates consistent effectiveness across language, image, and video benchmarks as well as diagnostic tasks, and integrates seamlessly into existing Transformer training pipelines.
📝 Abstract
Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.
Problem

Research questions and friction points this paper is trying to address.

long-horizon video understanding
persistent spatial state
occlusion-sensitive reasoning
memory scalability
spatial inductive bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor Memory
fixed-size recurrent state
3D memory tensor
soft write
spatial inductive bias
🔎 Similar Papers
No similar papers found.