🤖 AI Summary
This work proposes a training-free multimodal pipeline that addresses the limitations of existing tour guide applications, which often rely on predefined content or proprietary data and struggle to deliver fine-grained, context-aware narration for user-captured landmark images. By integrating visual features from user photographs with open geospatial databases such as OpenStreetMap, the method leverages vision-language models, GPS coordinates, geometric matching, and geospatial queries to accurately identify and annotate landmarks. It further employs large language models to generate rich textual and spoken commentary. This approach represents the first integration of open geographic data with vision-language models to enable scalable, context-aware interpretation of both well-known and obscure landmarks without requiring model retraining, thereby significantly enhancing the immersion and interactivity of mobile tour experiences.
📝 Abstract
We present AutoTour, a system that enhances user exploration by automatically generating fine-grained landmark annotations and descriptive narratives for photos captured by users. The key idea of AutoTour is to fuse visual features extracted from photos with nearby geospatial features queried from open matching databases. Unlike existing tour applications that rely on pre-defined content or proprietary datasets, AutoTour leverages open and extensible data sources to provide scalable and context-aware photo-based guidance. To achieve this, we design a training-free pipeline that first extracts and filters relevant geospatial features around the user's GPS location. It then detects major landmarks in user photos through VLM-based feature detection and projects them into the horizontal spatial plane. A geometric matching algorithm aligns photo features with corresponding geospatial entities based on their estimated distance and direction. The matched features are subsequently grounded and annotated directly on the original photo, accompanied by large language model-generated textual and audio descriptions to provide an informative, tour-like experience. We demonstrate that AutoTour can deliver rich, interpretable annotations for both iconic and lesser-known landmarks, enabling a new form of interactive, context-aware exploration that bridges visual perception and geospatial understanding.