🤖 AI Summary
This work addresses the challenge of long-term visual-spatial understanding in unbounded video streams by proposing a streaming spatial intelligence framework. The approach integrates test-time training (TTT) with fast weight mechanisms to dynamically adapt a subset of parameters to continuous input, employs sliding-window attention and a hybrid neural architecture for efficient long-sequence processing, and incorporates 3D spatio-temporal convolutions to enhance geometric correspondence and temporal coherence modeling. The authors also introduce the first video dataset featuring dense 3D spatial annotations, designed to facilitate structured learning of global spatial signals. Evaluated on multiple video spatial understanding benchmarks, the method achieves state-of-the-art performance, substantially improving spatial perception and memory capabilities in long-duration scenarios.
📝 Abstract
Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.