KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing LLM evaluation methods are largely confined to static, single-domain tasks, failing to comprehensively assess general-purpose reasoning capabilities. To address this, we propose KORGym—the first dynamic, gamified evaluation platform integrating orthogonal knowledge design, multimodal interaction, and reinforcement learning (RL) paradigms. It encompasses 50+ text- and vision-based reasoning games, supporting multi-turn dialogue and RL-driven scenarios. Built upon Gymnasium, KORGym integrates a multimodal game engine, standardized RL environment interfaces, response parsers, and a normalized evaluation protocol. Extensive experiments across 19 LLMs and 8 VLMs demonstrate that KORGym effectively uncovers intra-family reasoning consistency, validates the advantages of closed-source models, and quantifies marginal effects of modality type, policy selection, and response length on performance. The platform significantly enhances cross-model comparability and the completeness of evaluation dimensions.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating general reasoning capabilities of LLMs comprehensively

Addressing domain-specific limitations in current benchmarks

Developing interactive multi-turn assessment for LLM reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic game platform for LLM reasoning evaluation

Supports interactive multi-turn reinforcement learning assessments

Offers over fifty textual or visual format games

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning