From Editor to Dense Geometry Estimator

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work investigates how visual priors from image editing models—not generative models—can enhance dense geometric estimation. To this end, we propose FE2E, the first framework to employ a Diffusion Transformer (DiT)-based editing model for joint monocular depth and surface normal estimation. Our method introduces a flow-matching-driven consistency velocity loss, BFloat16 quantization for numerical stability, and global attention to refine features and ensure stable convergence. FE2E enables zero-shot, cost-free joint prediction of depth and normals without additional annotations or data augmentation. On ETH3D, it achieves over 35% improvement in depth estimation over prior state-of-the-art methods and significantly outperforms DepthAnything variants trained on 100× more data. These results empirically validate editing models as superior sources of geometric priors compared to conventional diffusion-based generators.

Technology Category

Application Category

📝 Abstract

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce extbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$ imes$ data. The project page can be accessed href{https://amap-ml.github.io/FE2E/}{here}.

Problem

Research questions and friction points this paper is trying to address.

Adapting image editing models for dense geometry estimation

Reformulating loss functions for deterministic prediction tasks

Resolving precision conflicts in model output formats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts Diffusion Transformer editor for geometry prediction

Uses consistent velocity training objective

Employs logarithmic quantization for precision alignment

🔎 Similar Papers

GaussianBlock: Building Part-Aware Compositional and Editable 3D Scene by Primitives and Gaussians