Seurat: From Moving Points to Depth

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Monocular video depth estimation suffers from geometric ambiguity and insufficient depth cues. To address this, we propose an unsupervised zero-shot method for relative depth estimation by modeling the spatiotemporal evolution of motion point trajectories. Inspired by human visual perception, our approach explicitly encodes temporal variations in point size and inter-point spacing. Crucially, this work is the first to introduce spatiotemporal trajectory modeling into zero-shot depth estimation, eliminating reliance on stereo matching or ground-truth depth supervision. We employ an off-the-shelf 2D point tracker to extract trajectories and design a dual-branch Transformer architecture—spatial and temporal branches—that jointly learns trajectory representations. Evaluated on the TAPVid-3D benchmark, our method achieves state-of-the-art zero-shot performance, yielding temporally smooth, high-fidelity depth predictions with strong cross-domain generalization (synthetic-to-real).

Technology Category

Application Category

📝 Abstract

Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.

Problem

Research questions and friction points this paper is trying to address.

Estimating depth from monocular videos without stereopsis cues

Inferring relative depth using 2D trajectories and motion patterns

Achieving zero-shot generalization from synthetic to real-world data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 2D trajectories for depth estimation

Employs spatial and temporal transformers

Achieves zero-shot generalization

🔎 Similar Papers

Using a CNN Model to Assess Paintings' Creativity