ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scene Coordinate Regression (SCR) methods suffer from poor generalization in visual relocalization, primarily because conventional frameworks couple training-view information into the regressor’s weights, resulting in low robustness to unseen imaging conditions such as lighting and viewpoint variations. To address this, we propose a novel paradigm that decouples map representation from coordinate regression: we adopt a universal Transformer backbone and inject lightweight, learnable, scene-specific map tokens for each scene. Crucially, we conduct the first large-scale self-supervised pretraining across tens of thousands of scenes—enabling cross-scene knowledge transfer—and subsequently adapt to new scenes via fine-tuning only the map tokens using minimal scene-specific data. Evaluated on multiple challenging relocalization benchmarks, our method significantly improves both robustness and accuracy of pose estimation while maintaining low computational overhead, effectively overcoming the generalization bottleneck of SCR.

Technology Category

Application Category

📝 Abstract
Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization of scene coordinate regression methods
Overcoming overfitting to training views in visual relocalization
Enhancing robustness across varying lighting and viewpoints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separates coordinate regressor from map representation
Pre-trains transformer on thousands of scenes
Trains transformer to generalize to unseen queries
🔎 Similar Papers
No similar papers found.