StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work proposes a feedforward dynamic street scene reconstruction method that operates without camera poses or object trackers, targeting efficient closed-loop simulation and downstream tasks in autonomous driving. By integrating temporal mask attention and a causal feedforward architecture, the model explicitly captures motion cues from monocular image sequences and unifies static and dynamic scene content within a 3D Gaussian splatting representation. Joint optimization leveraging cross-frame rendering and spatiotemporal consistency constraints enables high-fidelity synthesis of novel views at arbitrary future timesteps. Evaluated on the Waymo Open Dataset, the approach significantly outperforms existing methods in novel view synthesis and depth estimation, while demonstrating strong zero-shot generalization on datasets such as CARLA.

Technology Category

Application Category

📝 Abstract

Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: https://streetforward.github.io.

Problem

Research questions and friction points this paper is trying to address.

dynamic street reconstruction

feedforward reconstruction

autonomous driving

novel view synthesis

3D scene representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

feedforward reconstruction

temporal mask attention

3D Gaussian Splatting