🤖 AI Summary
Diffusion models face a fundamental contradiction in dense prediction tasks: their inherent stochastic noise is incompatible with the deterministic geometric mapping required for accurate spatial reasoning, leading to structural degradation and spatial detail distortion. This work is the first to formally identify and address this conflict, proposing a noise-free deterministic inference framework. It decomposes pre-trained diffusion models into an ensemble of time-step-specific visual experts, then integrates their implicit geometric priors via denoising reconstruction, time-varying expert ensembling, and self-supervised prior aggregation. A lightweight task-specific fine-tuning stage follows to enable single-step precise prediction. The method achieves state-of-the-art or competitive performance on depth estimation and surface normal prediction, while requiring ≤50% of the training data needed by conventional approaches—significantly improving both efficiency and accuracy.
📝 Abstract
Although diffusion models with strong visual priors have emerged as powerful dense prediction backboens, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce $mathrm{D}^{mathrm{3}}$-Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, $mathrm{D}^{mathrm{3}}$-Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that $mathrm{D}^{mathrm{3}}$-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage_D3-Predictor/.