š¤ AI Summary
To address the high computational cost of pairwise matching and ill-posedness of pose graph optimization in multi-view point cloud global registration, this paper proposes the first end-to-end feedforward framework for joint pose prediction. Our method jointly predicts all camera poses in a single forward pass, eliminating iterative optimization. Key contributions include: (1) a Registration Transformer that embeds multi-view point clouds into a unified latent space; (2) incorporation of attention priors from 2D foundation models to enhance 3D geometric consistency via cross-modal 2Dā3D attention transfer; and (3) an SE(3)^N joint Lie-group diffusion fine-tuning framework, supervised by a variational lower bound to enable pose-prior-guided denoising. The architecture integrates sparse 3D CNN-based superpoint encoding, geometric alternating attention, and cross-modal attention transfer. Evaluated on 3DMatch, ScanNet, and ARKitScenes, our method achieves state-of-the-art accuracy with significantly improved inference speed, offering both high precision and efficiency.
š Abstract
Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.