🤖 AI Summary
Existing streaming multi-view reconstruction methods struggle to balance long-sequence processing with real-time performance: offline optimization is computationally prohibitive, while lightweight approaches are inherently limited to short sequences. This paper introduces the first real-time 3D reconstruction framework capable of handling arbitrarily long streaming multi-view inputs. Our core contributions are threefold: (1) a 3D spatiotemporal memory mechanism that enables efficient state updates via memory gating and dynamic pruning; (2) a dual-source fine-grained decoder that jointly leverages geometric and appearance cues; and (3) a two-stage curriculum learning strategy coupled with adaptive resolution adjustment to jointly mitigate redundancy and enhance reconstruction fidelity. Evaluated on long-sequence benchmarks, our method significantly outperforms prior streaming approaches—achieving state-of-the-art reconstruction quality while maintaining real-time inference speed (>15 FPS).
📝 Abstract
Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed. Project page: https://zgchen33.github.io/LONG3R/.