STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sequential 3D reconstruction methods face two key bottlenecks: prohibitively expensive global optimization and memory mechanisms that scale poorly to long sequences. This paper introduces STream3R—the first streaming 3D reconstruction framework built upon a causal, decoder-only Transformer—formulating point cloud prediction as an autoregressive sequence generation task. By leveraging causal self-attention, STream3R implicitly encodes geometric priors and enables efficient online processing of both static and dynamic scenes. The architecture supports large-scale pretraining and fine-tuning in an LLM-style paradigm, substantially enhancing long-sequence modeling capability and real-time performance. Evaluated on multiple static and dynamic benchmarks, STream3R achieves superior reconstruction accuracy with significantly lower computational overhead compared to prior art. It is the first work to empirically validate the effectiveness and scalability of causal Transformers for online 3D perception.

Technology Category

Application Category

📝 Abstract
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.
Problem

Research questions and friction points this paper is trying to address.

Efficient 3D reconstruction from image sequences
Overcoming poor scalability in multi-view reconstruction
Handling dynamic scenes with causal Transformer models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only Transformer for 3D reconstruction
Causal attention for efficient sequence processing
LLM-style training for scalable 3D tasks
🔎 Similar Papers
No similar papers found.