CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing LLM/LVLM evaluation frameworks lack modeling of dynamic interactions between textual semantics and visual structural constraints. This paper introduces CrossWordBench—the first cross-modal reasoning benchmark grounded in crossword puzzles—employing controllable generation to fuse textual clues with grid-based crossing constraints, thereby unifying assessment of semantic reasoning and cross-modal coordination. Its contributions are threefold: (1) a novel, controllable evaluation paradigm jointly enforcing textual semantic and visual structural constraints; (2) support for multi-format outputs and interactive evaluation, exposing models’ deficiencies in modeling implicit letter-overlap logic; (3) a rule-LLM collaborative puzzle generation framework and a multi-granularity evaluation protocol (end-to-end solving, interactive prompting, subtask decomposition). Experiments across 20+ models show that reasoning-oriented LLMs significantly outperform baselines; LVLM performance is highly contingent on grid parsing accuracy; and current models exhibit systematic bottlenecks in structured, implicit constraint reasoning.

Technology Category

Application Category

📝 Abstract

Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal reasoning in LLMs and LVLMs using crossword puzzles

Assessing dynamic interplay between text and visual constraints in models

Measuring grid-parsing accuracy and puzzle-solving performance in LVLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

CrossWordBench evaluates LLMs and LVLMs via crossword puzzles

Controllable puzzle generation supports text and image formats

Interactive evaluation modes assess multimodal reasoning capabilities

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?