Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses

📅 2024-11-15

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address high latency in binocular depth estimation for AR glasses—caused by conventional image rectification and cost-volume computation—this paper proposes HomoDepth, an end-to-end learnable lightweight architecture. Methodologically, it introduces (1) a novel rectification-position encoding (RPE)-guided single-stage homography matrix prediction network, eliminating explicit geometric rectification; (2) the MultiHeadDepth module, which approximates cosine similarity via grouped pointwise convolutions and layer normalization, enabling cost-volume-free stereo matching; and (3) a multi-task disparity-robust training strategy to enhance generalization on unrectified or misaligned stereo pairs. Experiments show that MultiHeadDepth achieves 11.8–30.3% higher accuracy and 22.9–25.2% lower latency than industrial state-of-the-art methods; HomoDepth reduces end-to-end latency by 44.5%; and the disparity-robust training further decreases AbsRel error by 10.0–24.3%.

Technology Category

Application Category

📝 Abstract

Stereo depth estimation is a fundamental component in augmented reality (AR), which requires low latency for real-time processing. However, preprocessing such as rectification and non-ML computations such as cost volume require significant amount of latency exceeding that of an ML model itself, which hinders the real-time processing required by AR. Therefore, we develop alternative approaches to the rectification and cost volume that consider ML acceleration (GPU and NPUs) in recent hardware. For pre-processing, we eliminate it by introducing homography matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images. For cost volume, we replace it with a group-pointwise convolution-based operator and approximation of cosine similarity based on layernorm and dot product. Based on our approaches, we develop MultiHeadDepth (replacing cost volume) and HomoDepth (MultiHeadDepth + removing pre-processing) models. MultiHeadDepth provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. HomoDepth, which can directly process unrectified images, reduces the end-to-end latency by 44.5%. We also introduce a multi-task learning method to handle misaligned stereo inputs on HomoDepth, which reduces the AbsRel error by 10.0-24.3%. The overall results demonstrate the efficacy of our approaches, which not only reduce the inference latency but also improve the model performance. Our code is available at https://github.com/UCI-ISA-Lab/MultiHeadDepth-HomoDepth

Problem

Research questions and friction points this paper is trying to address.

Reducing latency in stereo depth estimation for AR glasses

Eliminating preprocessing steps like rectification in depth estimation

Replacing cost volume with efficient ML-accelerated operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Homography matrix prediction network with RPE

Group-pointwise convolution replaces cost volume

Multi-task learning for misaligned stereo inputs

🔎 Similar Papers

A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts

2024-09-26arXiv.orgCitations: 0

ByteDance

San Jose

Research Scientist Intern, Multimodal Contextual AI (PhD)