HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

๐Ÿ“… 2026-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

234K/year
๐Ÿค– AI Summary
Online 3D reconstruction often suffers from drift, jitter, or structural collapse in long sequences due to insufficient modeling of temporally heterogeneous geometric evidence. This work proposes a Transformer-based streaming reconstruction framework that explicitly decomposes geometric propagation into evidence influence kernels. By integrating geometric linear attention, local spatiotemporal rotary position embeddings (RoPE), and metric readout tokens under causal constraints, the method enables multi-scale, bounded, and stable geometric propagation. Trained on only 48-frame clips, the model generalizes effectively to sequences exceeding ten thousand frames, achieving state-of-the-art performance with constant memory usage and linear time complexity.
๐Ÿ“ Abstract
Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/
Problem

Research questions and friction points this paper is trying to address.

streaming 3D reconstruction
long-horizon attention
temporal heterogeneity
camera pose estimation
scene geometry
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-Horizon Attention
Streaming 3D Reconstruction
Geometric Linear Attention
Spatiotemporal RoPE
Evidence Influence Kernel
๐Ÿ”Ž Similar Papers
No similar papers found.