π€ AI Summary
Existing LLM evaluation methods are largely confined to static, single-domain tasks, failing to comprehensively assess general-purpose reasoning capabilities. To address this, we propose KORGymβthe first dynamic, gamified evaluation platform integrating orthogonal knowledge design, multimodal interaction, and reinforcement learning (RL) paradigms. It encompasses 50+ text- and vision-based reasoning games, supporting multi-turn dialogue and RL-driven scenarios. Built upon Gymnasium, KORGym integrates a multimodal game engine, standardized RL environment interfaces, response parsers, and a normalized evaluation protocol. Extensive experiments across 19 LLMs and 8 VLMs demonstrate that KORGym effectively uncovers intra-family reasoning consistency, validates the advantages of closed-source models, and quantifies marginal effects of modality type, policy selection, and response length on performance. The platform significantly enhances cross-model comparability and the completeness of evaluation dimensions.
π Abstract
Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.