Pixel-Perfect Visual Geometry Estimation

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the limitations of existing vision-based geometric foundation models, which often suffer from flying pixel artifacts and loss of fine details in monocular and video depth estimation, leading to low-quality 3D point clouds. To overcome these challenges, the authors propose the Pixel-space Probabilistic Diffusion Transformer (PPD) and its video extension, PPVD. These models leverage a semantic prompting mechanism and a cascaded DiT architecture to enhance fine-grained geometric details while preserving global semantic consistency. For video scenarios, PPVD further incorporates multi-view geometry-guided temporal token propagation to enable efficient and temporally coherent modeling. Experimental results demonstrate that the proposed approach achieves state-of-the-art performance in both monocular and video depth estimation, producing significantly cleaner and more detailed 3D point clouds compared to existing generative depth estimation models.

Technology Category

Application Category

📝 Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

Problem

Research questions and friction points this paper is trying to address.

flying pixels

fine details loss

visual geometry estimation

point cloud quality

monocular depth estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

pixel-perfect geometry

diffusion transformer

semantics-prompted generation