LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing feedforward 4D reconstruction models (e.g., VGGT, π³) suffer from O(N²) memory complexity, hindering streaming inference on long videos; meanwhile, streaming alternatives require extensive retraining and cannot leverage the strong geometric priors embedded in offline models. This paper proposes a training-free streaming adaptation framework, introducing the first inter-layer scale alignment mechanism: leveraging Sim(3) transformations for layered depth segmentation, propagating scale factors across sliding windows and temporal stamps layer-wise, and integrating sliding-window inference to preserve temporal geometric consistency while mitigating monocular depth ambiguity-induced inter-layer scale drift. Evaluated on kilometer-scale videos, our method achieves 14 FPS with only 6 GB peak GPU memory, matching state-of-the-art offline methods in pose and point-cloud accuracy—enabling, for the first time, low-latency, zero-retraining, high-fidelity 4D reconstruction in streaming deployment.

Technology Category

Application Category

📝 Abstract
Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $href{https://neu-vi.github.io/LASER/}{ exttt{https://neu-vi.github.io/LASER/}}$
Problem

Research questions and friction points this paper is trying to address.

Enables streaming 4D reconstruction from videos without retraining offline models.
Addresses layer depth misalignment across temporal windows in streaming reconstruction.
Reduces memory usage for kilometer-scale video processing to practical levels.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts offline models to streaming without retraining
Aligns predictions across windows via layer-wise scaling
Segments depth into layers for consistent scale propagation
🔎 Similar Papers
No similar papers found.