🤖 AI Summary
Existing code generation benchmarks overemphasize natural language understanding and reasoning while neglecting executable correctness, runtime performance, and multilingual support—particularly for non-English languages such as Russian—thus failing to reflect real-world production capabilities and risks. Method: We propose the first unified, practical programming–oriented evaluation framework for multilingual code generation, covering eight programming languages and eleven task categories, with explicit emphasis on code executability and real-world performance. We introduce a novel skill-based taxonomy for code assessment and develop an open-source, multi-environment automated evaluation platform featuring RESTful APIs, model testing pipelines, and a dynamic leaderboard. Contribution/Results: Experiments reveal substantial limitations of leading proprietary and open-weight LLMs in Russian code generation. All resources—including benchmarks, evaluation tools, and results—are publicly released to advance standardized, reproducible research in code generation evaluation.
📝 Abstract
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.