Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of jointly reconstructing dense dynamic scenes and estimating camera poses from multiple freely moving cameras, overcoming limitations of monocular inputs or reliance on pre-calibrated rigid camera arrays. The authors propose a two-stage optimization framework: the first stage constructs a spatiotemporal connectivity graph that extends visual SLAM to multi-camera settings by integrating temporal continuity and spatial overlap to achieve consistent scale and robust tracking; the second stage jointly optimizes dense depth and camera poses using wide-baseline optical flow. Key innovations include the spatiotemporal graph structure and a wide-baseline initialization strategy, which significantly enhance robustness in low-overlap scenarios. The study also introduces MultiCamRobolab, the first real-world multi-camera dataset with motion-capture ground truth. Experiments demonstrate superior performance over existing feedforward models on both synthetic and real data, with reduced memory consumption.

Technology Category

Application Category

📝 Abstract

We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.

Problem

Research questions and friction points this paper is trying to address.

dense dynamic scene reconstruction

camera pose estimation

multi-view videos

freely moving cameras

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view dynamic reconstruction

camera pose estimation

spatiotemporal graph

wide-baseline optical flow