🤖 AI Summary
Traditional visual geometric reconstruction relies on a fixed reference view, leading to instability or failure when the reference is poorly chosen. To address this, we propose the first fully permutation-equivariant framework for visual geometric reconstruction—eliminating reference-view dependency entirely. Our method employs a permutation-equivariant neural network that jointly predicts affine-invariant camera poses and scale-invariant local point maps, enabling robust, input-order-agnostic modeling. The architecture is a feedforward neural network with rigorously enforced permutation equivariance, guaranteeing symmetric handling of arbitrary input image permutations. We validate our approach on three core tasks: monocular/video depth estimation, camera pose estimation, and dense point map reconstruction. It achieves state-of-the-art performance across all benchmarks, demonstrating substantial improvements in generalization, robustness to input ordering and occlusion, and scalability to varying numbers of input views.
📝 Abstract
We introduce $π^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $π^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.