PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the open challenge of spatiotemporal zero-shot forecasting for images. Methodologically, it introduces the first explainable expert system that emulates human multi-cue, puzzle-style reasoning. It formally characterizes human puzzle-solving competence into five core capabilities and implements them modularly: a Perceiver-based visual encoder, a symbolic reasoner, a compositional information fusion module, a dynamic network retrieval component, and a noise-robust filter. The system supports adaptive external knowledge retrieval and interference suppression, thereby enhancing cross-domain generalization without compromising interpretability. Evaluated on TARA and WikiTilo benchmarks, it achieves state-of-the-art performance: its zero-shot spatiotemporal localization accuracy surpasses multimodal large models—including BLIP-2, LLaVA, and GPT-4V—by 32%, and outperforms automated reasoning frameworks such as VisProg by 38%.

Technology Category

Application Category

📝 Abstract
The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can't be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets -- TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.
Problem

Research questions and friction points this paper is trying to address.

Image Prediction
Time and Location Forecasting
Unseen Circumstances Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Reasoning
PuzzleGPT
Accuracy Improvement
🔎 Similar Papers
No similar papers found.