GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional LLM evaluation relies on static benchmarks, limiting adaptability to vertical domains and failing to characterize fine-grained reasoning capabilities. Method: This paper proposes an adaptive evaluation framework grounded in adversarial game theory, dynamically modeling domain knowledge and tracking multi-step reasoning chain integrity via a “Guess Who”-style interactive protocol. It innovatively integrates gamified design, domain-aware dynamic question generation, differentiable evaluation metrics, and lightweight knowledge graph construction. Contribution/Results: The framework enables plug-and-play evaluation across five domains—including finance and healthcare—with strong discriminative power in domain knowledge coverage and reasoning completeness. Experiments show a 62% improvement in evaluation interpretability and a 5.3× acceleration in scenario adaptation speed over traditional benchmarks; configuring the framework for a new domain requires only 2.1 hours on average.

Technology Category

Application Category

📝 Abstract
The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of static benchmarks in LLM evaluation
Adapting evaluation to diverse domain-specific knowledge and reasoning
Enhancing interpretability and scalability in LLM assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial game-based interactive evaluation framework
Dynamic domain knowledge modeling integration
Progressive reasoning assessment for fidelity
🔎 Similar Papers
No similar papers found.
Q
Qingchen Yu
MemTensor (Shanghai) Technology Co., Ltd.
Z
Zifan Zheng
University of Sydney
Ding Chen
Ding Chen
Postdoctoral Scholar, University of Texas Southwestern Medical Center
S
Simin Niu
Renmin University of China
B
Bo Tang
MemTensor (Shanghai) Technology Co., Ltd.
Feiyu Xiong
Feiyu Xiong
MemTensor (Shanghai) Technology Co., Ltd.
Machine LearningNLPLLM
Zhiyu Li
Zhiyu Li
Tianjin University
Robust controlattitude control