PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper addresses the open challenge of spatiotemporal zero-shot forecasting for images. Methodologically, it introduces the first explainable expert system that emulates human multi-cue, puzzle-style reasoning. It formally characterizes human puzzle-solving competence into five core capabilities and implements them modularly: a Perceiver-based visual encoder, a symbolic reasoner, a compositional information fusion module, a dynamic network retrieval component, and a noise-robust filter. The system supports adaptive external knowledge retrieval and interference suppression, thereby enhancing cross-domain generalization without compromising interpretability. Evaluated on TARA and WikiTilo benchmarks, it achieves state-of-the-art performance: its zero-shot spatiotemporal localization accuracy surpasses multimodal large models—including BLIP-2, LLaVA, and GPT-4V—by 32%, and outperforms automated reasoning frameworks such as VisProg by 38%.

Technology Category

Application Category

📝 Abstract

The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can't be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets -- TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.

Problem

Research questions and friction points this paper is trying to address.

Image Prediction

Time and Location Forecasting

Unseen Circumstances Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Reasoning

PuzzleGPT

Accuracy Improvement

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?