Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capability of large language models (LLMs) to predict player actions and generate executable commands for the Avrae Discord bot in *Dungeons & Dragons* (D&D) gameplay. We conduct controlled experiments on the FIREBALL dataset using two distinct LLMs: DeepSeek-R1-Distill-LLaMA-8B (reasoning-oriented) and LLaMA-3.1-8B-Instruct (instruction-tuned), systematically evaluating how prompt engineering affects output accuracy. Results demonstrate that the instruction-tuned model significantly outperforms the reasoning-oriented one; moreover, fine-grained, single-sentence prompt modifications induce substantial improvements in command generation fidelity—highlighting extreme sensitivity to instruction design. Our key contribution is the first empirical identification of instruction tuning as a critical factor in game-semantic parsing tasks. We further establish that precise instruction modeling—not raw reasoning capacity—is more effective for generating syntactically and semantically correct, structured game commands. This work provides a reproducible, practically viable technical pathway for LLM-driven interactive game systems.

Technology Category

Application Category

📝 Abstract
This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons & Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning versus instruct models for DnD action prediction
Assessing prompt engineering impact on command generation accuracy
Testing LLM performance on Avrae bot command formatting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using reasoning and instruct models for command generation
Evaluating models with FIREBALL dataset for DnD actions
Comparing prompt engineering effects on model outputs
🔎 Similar Papers
P
Patricia Delafuente
University of Maryland, Baltimore County
A
Arya Honraopatil
University of Maryland, Baltimore County
Lara J. Martin
Lara J. Martin
Assistant Professor, University of Maryland, Baltimore County
Narrative GenerationSpeech ProcessingArtificial IntelligenceComputational CreativityAAC