🤖 AI Summary
This work addresses the limitations of traditional 3D reconstruction methods, which predict point maps in camera-centered coordinates, struggle to incorporate scene structural priors, and suffer from high rotational degrees of freedom across views, leading to inconsistent reconstructions. To overcome these issues, the authors propose predicting point maps in a gravity-aligned upright coordinate system, thereby reducing inter-view rotational ambiguity through a shared vertical axis. They introduce the Gravity Grounded Geometry Transformer (G3T) model and the G3T-Long incremental reconstruction framework, which for the first time integrate gravity-aligned coordinates into point map prediction by combining a Transformer architecture, gravity-aware pose estimation, and a submap stitching strategy. Experiments demonstrate that this approach significantly improves reconstruction accuracy and robustness, outperforming existing methods in incremental 3D reconstruction and validating the effectiveness of gravity-aligned representations.
📝 Abstract
Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.