FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

📅 2024-03-19
🏛️ European Conference on Computer Vision
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient motion modeling and suboptimal exploitation of multi-frame information in video depth estimation, this paper proposes FutureDepth—a prediction-driven spatiotemporal representation learning framework. Its core innovation lies in jointly incorporating a future-frame feature prediction network (F-Net) and an adaptive masked multi-frame feature reconstruction network (R-Net), which implicitly model inter-frame motion and correspondence through iterative forward prediction and masked self-encoding. The method integrates temporal prediction, multi-frame reconstruction, depth decoding, and refinement modules into a unified architecture. Evaluated on diverse benchmarks—including NYUDv2, KITTI, DDAD, and Sintel—FutureDepth achieves state-of-the-art accuracy, significantly outperforming existing video-based depth estimation methods. Notably, its inference efficiency matches that of monocular models, enabling practical deployment without sacrificing performance.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models
Problem

Research questions and friction points this paper is trying to address.

Depth Estimation
Video Analysis
Multi-frame Information
Innovation

Methods, ideas, or system contributions that make the work stand out.

FutureDepth
F-Net
R-Net
🔎 Similar Papers
No similar papers found.