KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

πŸ“… 2025-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing LLM evaluation methods are largely confined to static, single-domain tasks, failing to comprehensively assess general-purpose reasoning capabilities. To address this, we propose KORGymβ€”the first dynamic, gamified evaluation platform integrating orthogonal knowledge design, multimodal interaction, and reinforcement learning (RL) paradigms. It encompasses 50+ text- and vision-based reasoning games, supporting multi-turn dialogue and RL-driven scenarios. Built upon Gymnasium, KORGym integrates a multimodal game engine, standardized RL environment interfaces, response parsers, and a normalized evaluation protocol. Extensive experiments across 19 LLMs and 8 VLMs demonstrate that KORGym effectively uncovers intra-family reasoning consistency, validates the advantages of closed-source models, and quantifies marginal effects of modality type, policy selection, and response length on performance. The platform significantly enhances cross-model comparability and the completeness of evaluation dimensions.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
Problem

Research questions and friction points this paper is trying to address.

Evaluating general reasoning capabilities of LLMs comprehensively
Addressing domain-specific limitations in current benchmarks
Developing interactive multi-turn assessment for LLM reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic game platform for LLM reasoning evaluation
Supports interactive multi-turn reinforcement learning assessments
Offers over fifty textual or visual format games
πŸ”Ž Similar Papers
No similar papers found.