🤖 AI Summary
Existing attribution methods for image geolocation models struggle to reveal whether predictions rely on human-interpretable, object-level visual cues. This work proposes an object-centric analysis pipeline that first extracts salient regions from attribution maps such as Grad-CAM, then decomposes them into object-like elements using image segmentation. The predictive relevance of these elements is rigorously evaluated through crop-based deletion and insertion tests. This approach enables, for the first time, an object-level interpretation of attribution outcomes. Experiments across three benchmark datasets demonstrate that attribution-guided cropping preserves significantly more predictive information than random cropping, providing strong evidence that geolocation models indeed leverage localized, interpretable object-level cues in their decision-making process.
📝 Abstract
When humans play geolocation games such as GeoGuessr, they rely on concrete visual cues, such as road markings, vegetation, or architectural details, to infer where an image was captured. Whether image geolocation models rely on similar object-level evidence remains difficult to determine, as attribution methods like Grad-CAM typically highlight diffuse regions rather than coherent visual entities, making it difficult to link model predictions to specific objects or perceptible patterns. In this work, we propose an object-centric analysis pipeline to investigate the visual evidence used by geolocation models. Starting from attribution maps, we extract salient regions and segment them into object-like elements. We evaluate their predictive relevance through deletion and insertion tests, comparing attributionguided crops to randomly selected regions with similar coverage. Experiments on a three-country benchmark show that attribution-guided crops consistently retain more information for the model's prediction than random crops. These results suggest that attribution maps can be decomposed into interpretable, perceptible elements, providing a step toward object-level analysis of geolocation models.