🤖 AI Summary
This work addresses the challenge of effectively integrating 3D priors—such as camera pose, intrinsic parameters, and depth—into pre-trained image models to enhance multi-view reconstruction performance without modifying or retraining the underlying network. The authors propose a Test-time Constrained Optimization (TCO) framework that incorporates these priors as soft constraints during inference. By jointly optimizing the output of a multi-view Transformer using self-supervised photometric and geometric consistency losses along with prior-based regularization terms, TCO achieves substantial improvements in reconstruction accuracy. Evaluated on benchmarks including ETH3D, 7-Scenes, and NRGBD, the method reduces point cloud Chamfer distance errors by over 50% compared to image-only baselines and outperforms prior-aware feed-forward approaches that require retraining, all while preserving the original model architecture.
📝 Abstract
We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.