Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing generative monocular depth estimation methods rely on VAEs to compress depth maps into latent space, causing edge distortion and floating pixels. To address this, we propose Semantic-guided Progressive Diffusion Transformer (SP-DiT), which directly models the depth distribution in pixel space—bypassing VAE-induced distortions. SP-DiT leverages semantic representations from vision foundation models as conditional guidance for the diffusion process and introduces a cascaded DiT architecture with token progressive growth to jointly ensure global semantic coherence and local geometric fidelity. Evaluated on five mainstream benchmarks, SP-DiT achieves state-of-the-art performance across all metrics. Notably, it significantly outperforms prior generative approaches on edge-sensitive point cloud reconstruction metrics—demonstrating the first method capable of high-fidelity, pixel-accurate joint generation of depth maps and corresponding point clouds.

Technology Category

Application Category

📝 Abstract

This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces extit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

Problem

Research questions and friction points this paper is trying to address.

Eliminating VAE-induced flying pixels in depth estimation

Overcoming high complexity of pixel-space diffusion generation

Enhancing semantic consistency and fine-grained details in depth maps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-space diffusion generation eliminates VAE artifacts

Semantics-Prompted Diffusion Transformers enhance semantic consistency

Cascade DiT Design progressively increases tokens for efficiency

🔎 Similar Papers

No similar papers found.