🤖 AI Summary
This paper addresses the open challenge of spatiotemporal zero-shot forecasting for images. Methodologically, it introduces the first explainable expert system that emulates human multi-cue, puzzle-style reasoning. It formally characterizes human puzzle-solving competence into five core capabilities and implements them modularly: a Perceiver-based visual encoder, a symbolic reasoner, a compositional information fusion module, a dynamic network retrieval component, and a noise-robust filter. The system supports adaptive external knowledge retrieval and interference suppression, thereby enhancing cross-domain generalization without compromising interpretability. Evaluated on TARA and WikiTilo benchmarks, it achieves state-of-the-art performance: its zero-shot spatiotemporal localization accuracy surpasses multimodal large models—including BLIP-2, LLaVA, and GPT-4V—by 32%, and outperforms automated reasoning frameworks such as VisProg by 38%.
📝 Abstract
The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can't be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets -- TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.