🤖 AI Summary
This work addresses the challenge of achieving metric-scale multi-view 3D reconstruction in robotic visual navigation under unknown or partially known camera parameters. We propose UniScale, a unified feedforward network framework that, for the first time, jointly estimates camera intrinsics and extrinsics, scale-invariant depth, and point clouds within a single model while recovering the scene’s true metric scale. Leveraging a semantic-guided modular architecture, UniScale integrates global contextual reasoning with camera-aware feature representations and flexibly incorporates geometric priors—such as known intrinsics or poses—to enhance accuracy without requiring retraining. The model effectively reuses world-scale priors embedded in its pretrained weights. Experiments demonstrate that UniScale achieves strong generalization and robustness across multiple benchmarks, consistently delivering high-precision metric reconstructions.
📝 Abstract
We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.