Dens3R: A Foundation Model for 3D Geometry Prediction

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dense 3D reconstruction methods typically predict individual geometric quantities—such as depth, surface normals, or point clouds—in isolation, leading to cross-attribute inconsistency and limiting accuracy and generalization. To address this, we propose the first unified multi-geometric joint regression framework that explicitly encodes geometric constraints among depth, normals, and point maps via structural coupling. We further introduce rotation-invariant positional interpolation encoding to ensure cross-view consistency during inference. Our method employs a lightweight shared encoder-decoder architecture, a two-stage training strategy, image-pair matching-based feature fusion, and geometrically consistent multi-view post-processing. Evaluated under both single- and multi-view settings, our approach achieves significant improvements over state-of-the-art methods across multiple dense 3D tasks—including depth estimation, normal prediction, and point cloud reconstruction—demonstrating strong generalization, robustness to high-resolution inputs, and broad applicability to downstream vision tasks.

Technology Category

Application Category

📝 Abstract
Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.
Problem

Research questions and friction points this paper is trying to address.

Unified prediction of correlated 3D geometric quantities
Ensuring consistency in joint geometric property estimation
Generalizable 3D foundation model for diverse downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for joint geometric regression
Lightweight shared encoder-decoder backbone design
Position-interpolated rotary positional encoding
🔎 Similar Papers
No similar papers found.
X
Xianze Fang
Alibaba Group, China
Jingnan Gao
Jingnan Gao
Ph.D. student at Shanghai Jiao Tong University
Computer Vision
Z
Zhe Wang
Alibaba Group, China
Z
Zhuo Chen
Shanghai Jiao Tong University, China
Xingyu Ren
Xingyu Ren
Ph.D. graduate, Shanghai Jiao Tong University
Face ModelingGenerative AI
J
Jiangjing Lv
Alibaba Group, China
Q
Qiaomu Ren
Alibaba Group, China
Z
Zhonglei Yang
Alibaba Group, China
X
Xiaokang Yang
Shanghai Jiao Tong University, China
Y
Yichao Yan
Shanghai Jiao Tong University, China
C
Chengfei Lv
Alibaba Group, China