Baba is LLM: Reasoning in a Game with Dynamic Rules

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of evaluating large language models’ (LLMs) capacity for real-time, language-driven logical reasoning and dynamic rule rewriting—exemplified by the puzzle game *Baba is You*, where distinguishing between “use” and “mention” of textual elements induces a meta-logical ambiguity. Method: We formalize the game’s structured state transitions as a benchmark and design three prompting strategies (basic, rule expansion, action expansion); additionally, we supervise fine-tune open-weight models (e.g., Mistral, OLMo) on game-state data. Contribution/Results: Closed-source models (e.g., GPT-4o) substantially outperform open-weight baselines; fine-tuning improves state parsing accuracy but fails to enhance valid action generation. This study establishes the first LLM evaluation framework grounded in dynamic rule systems, exposing fundamental limitations in symbolic manipulation, causal rule modeling, and reflective reasoning—thereby offering theoretical insights and empirical grounding for developing AI with evolvable logical capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are known to perform well on language tasks, but struggle with reasoning tasks. This paper explores the ability of LLMs to play the 2D puzzle game Baba is You, in which players manipulate rules by rearranging text blocks that define object properties. Given that this rule-manipulation relies on language abilities and reasoning, it is a compelling challenge for LLMs. Six LLMs are evaluated using different prompt types, including (1) simple, (2) rule-extended and (3) action-extended prompts. In addition, two models (Mistral, OLMo) are finetuned using textual and structural data from the game. Results show that while larger models (particularly GPT-4o) perform better in reasoning and puzzle solving, smaller unadapted models struggle to recognize game mechanics or apply rule changes. Finetuning improves the ability to analyze the game levels, but does not significantly improve solution formulation. We conclude that even for state-of-the-art and finetuned LLMs, reasoning about dynamic rule changes is difficult (specifically, understanding the use-mention distinction). The results provide insights into the applicability of LLMs to complex problem-solving tasks and highlight the suitability of games with dynamically changing rules for testing reasoning and reflection by LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to play Baba is You with dynamic rule changes

Assessing reasoning challenges in rule-manipulation puzzle games for LLMs

Testing finetuned and base LLMs on language-based game mechanics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs with varied prompt strategies

Finetunes models using game textual data

Tests dynamic rule reasoning in games

🔎 Similar Papers

Baba Is AI: Break the Rules to Beat the Benchmark