🤖 AI Summary
Efficient localization and high-fidelity rendering in large-scale scenes face challenges including excessive computational cost and limited capacity of single-network architectures. To address these, we propose a Mixture-of-Experts (MoE)-accelerated scene coordinate regression framework: a lightweight gating network dynamically selects a single expert per inference pass, ensuring sparse activation; and a loss-free load-balancing strategy guarantees equitable expert utilization without auxiliary objectives. Our method achieves significant improvements in both localization accuracy and training efficiency while maintaining low computational overhead. Experiments on the Cambridge Landmarks dataset demonstrate that our approach attains high-precision pose estimation and photorealistic rendering within just 10 minutes of training—substantially reducing computational cost compared to state-of-the-art methods. Specifically, it lowers median pose error by 23.6% and improves rendering PSNR by 1.8 dB.
📝 Abstract
Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.