🤖 AI Summary
Indoor multimedia geolocation in digital forensics remains challenging due to structural layout similarity, variable illumination, GPS signal absence, and scarcity of training data. Method: This paper introduces a novel semantic anchoring paradigm leveraging globally standardized electrical power sockets as geolocative cues. We propose an end-to-end deep learning pipeline: YOLOv11 for socket detection, Xception for fine-grained classification of 12 socket types, and rule-based mapping combined with confidence-thresholded inference to determine the country of origin. Contribution/Results: We construct the first forensic-oriented benchmark dataset supporting both socket detection and classification. Evaluated on the real-world crime imagery dataset TraffickCam, our method achieves a detection mAP@0.5 of 0.843, classification accuracy of 91.2%, and 96% country inference accuracy at >90% confidence. All code, models, and dual-purpose datasets are publicly released.
📝 Abstract
Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at >90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.