EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing benchmarks inadequately assess large language models’ capabilities in multimodal reasoning, long-range inference, implicit knowledge integration, and lateral reasoning. Method: We introduce the first multimodal reasoning benchmark grounded in authentic competition-style puzzles—comprising 1,184 high-difficulty items—leveraging professional puzzle contests as a cognitive frontier testbed. The benchmark features human-curated problem sets, formalized solution annotations, and verifiable ground-truth answers to enable automated, fine-grained capability diagnostics. Contribution/Results: Experiments reveal that state-of-the-art large models achieve significantly lower accuracy on this benchmark than on other challenging benchmarks (e.g., Humanity’s Last Exam), exposing fundamental limitations in discovering implicit associations and performing unstructured, multi-step deductive reasoning. This work establishes a novel paradigm for rigorously evaluating the cognitive boundaries of foundation models.

Technology Category

Application Category

📝 Abstract

As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity -- each typically requiring teams of skilled solvers hours to days to complete -- with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity's Last Exam, unveiling models' shortcomings when challenged with problems requiring unstructured and lateral reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluates advanced multimodal reasoning in language models

Introduces puzzles for implicit knowledge and deductive skills testing

Highlights models' limitations in unstructured and lateral reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal reasoning challenges

Implicit knowledge synthesis

Multi-step deductive reasoning

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models