🤖 AI Summary
Perspective-dependent blur induced by camera rotation during handheld capture exhibits depth-varying blur kernels, posing a fundamental challenge for monocular depth estimation.
Method: We propose a blur-pattern–aware monocular depth estimation framework that jointly models the spatial distribution of motion blur and camera trajectory in video sequences. Leveraging sliding-window embedding and multi-window aggregation, we employ vision-language models to densely interpolate sparse point trajectories obtained via point tracking, thereby enhancing the fidelity of depth–blur mapping.
Contribution/Results: To our knowledge, this is the first work to integrate the depth-dependent nature of perspective blur with vision-language priors, enabling metric-scale depth prediction and high-fidelity trajectory reconstruction without stabilization hardware. Extensive evaluation on multiple standard depth benchmarks demonstrates significant improvements over state-of-the-art unsupervised and self-supervised methods—achieving broader depth range coverage, superior generalization, and up to 32% reduction in trajectory reconstruction error.
📝 Abstract
In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.