POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses geometric inconsistency in dynamic-scene 3D reconstruction caused by ambiguity in point-cloud matching within dynamic regions. We propose the first unified framework jointly modeling point-cloud matching and explicit temporal motion. Built upon DUSt3R’s point-cloud representations, our method integrates multi-view geometric mapping, differentiable cross-view RGB-to-3D point-cloud matching, a temporal motion constraint network, and scale-consistent optimization—all co-optimized within a shared 3D coordinate system to simultaneously refine geometry and correspondences. Compared to prior approaches, our method significantly mitigates matching ambiguity in dynamic regions, yielding substantial improvements in depth estimation, 3D point tracking, and camera pose estimation. It achieves state-of-the-art performance across multiple benchmarks. The code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract

3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.

Problem

Research questions and friction points this paper is trying to address.

Unify geometry estimation and matching in dynamic 3D reconstruction

Improve ambiguous matching in dynamic regions for better accuracy

Ensure scale consistency and enhance 3D point tracking performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies geometry estimation and matching in 3D space

Learns explicit matching via RGB to 3D pointmaps

Introduces temporal motion for scale consistency

🔎 Similar Papers

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion