STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Monocular video depth estimation suffers from inter-frame depth inconsistency, and existing approaches rely on motion priors—such as optical flow or camera parameters—resulting in high memory overhead and poor robustness to dynamic or irregular motion. This paper proposes a temporal consistency modeling framework that requires no additional motion priors. First, a static-dynamic region mask is generated based on surface normal discrepancies. Then, decoupled modeling is performed: a Masked Static module enhances temporal consistency in static regions, while a Surface Normal Similarity module aligns features in dynamic regions. Finally, multi-scale feature fusion and joint refinement are applied. Evaluated on the KITTI and NYUv2 video depth benchmarks, our method achieves state-of-the-art performance, significantly improving temporal consistency while reducing memory consumption and dependency on motion assumptions.

Technology Category

Application Category

📝 Abstract
Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.
Problem

Research questions and friction points this paper is trying to address.

Improving temporal depth consistency in video monocular depth estimation
Addressing limitations of traditional multi-frame and optical flow methods
Separately handling static and dynamic areas without external information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separately learns static and dynamic area temporal consistency
Uses surface normal variance to identify static dynamic regions
Integrates static dynamic refinements for full sequence consistency