Codenames as a Benchmark for Large Language Models

📅 2024-12-16

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This paper addresses the lack of systematic evaluation benchmarks for large language models (LLMs) on multidimensional higher-order cognitive capabilities—including linguistic understanding, theory of mind, and abductive reasoning—by pioneering the adaptation of the word-association board game Codenames as a novel benchmark. Methodologically, it introduces a multi-agent collaborative framework featuring dynamic role assignment (clue-giver vs. clue-finder) and cooperative prompting strategies, evaluated across state-of-the-art models including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1. Key contributions are: (1) empirical demonstration of pronounced capability specialization in LLMs across roles—clue-givers exhibit superior conceptual abstraction and lexical grounding, while clue-finders excel in contextual inference and ambiguity resolution; and (2) validation that multi-model collaboration substantially outperforms traditional word-embedding baselines, achieving greater robustness across diverse board configurations, adaptability to heterogeneous team compositions, and generalization to unseen semantic domains.

Technology Category

Application Category

📝 Abstract

In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' reasoning using Codenames as a benchmark

Assessing AI's language understanding and theory of mind

Comparing cooperative performance of LLM combinations in gameplay

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using Codenames game as LLM benchmark

Evaluating multiple state-of-the-art LLMs

Testing cooperative performance of LLM combinations

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models