🤖 AI Summary
This work investigates the capability of mainstream large language models (LLMs)—including LLaMA2, Mistral, and ChatGPT—to solve cryptic crosswords, a demanding linguistic reasoning task requiring semantic punning, morphological decomposition, and cultural metaphor interpretation. To this end, we introduce the first human-annotated, high-level semantic reasoning benchmark specifically designed for cryptic crossword solving, and conduct unified evaluation under zero-shot and few-shot prompting paradigms. Results reveal that all evaluated models achieve accuracy below 15%, substantially underperforming human experts and exposing fundamental limitations in multi-step lexical manipulation, non-literal semantic parsing, and cultural context modeling. Our contribution is twofold: (1) we identify critical bottlenecks in deep language understanding for current LLMs; and (2) we release the first standardized benchmark dedicated to evaluating cryptic logical reasoning—establishing a new paradigm for assessing and advancing cognitive capabilities in language models.
📝 Abstract
Cryptic crosswords are puzzles that rely not only on general knowledge but also on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models. However, the abilities of large language models (LLMs) have not yet been tested on this task. In this paper, we establish the benchmark results for three popular LLMs -- LLaMA2, Mistral, and ChatGPT -- showing that their performance on this task is still far from that of humans.