MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation benchmarks overemphasize natural language understanding and reasoning while neglecting executable correctness, runtime performance, and multilingual support—particularly for non-English languages such as Russian—thus failing to reflect real-world production capabilities and risks. Method: We propose the first unified, practical programming–oriented evaluation framework for multilingual code generation, covering eight programming languages and eleven task categories, with explicit emphasis on code executability and real-world performance. We introduce a novel skill-based taxonomy for code assessment and develop an open-source, multi-environment automated evaluation platform featuring RESTful APIs, model testing pipelines, and a dynamic leaderboard. Contribution/Results: Experiments reveal substantial limitations of leading proprietary and open-weight LLMs in Russian code generation. All resources—including benchmarks, evaluation tools, and results—are publicly released to advance standardized, reproducible research in code generation evaluation.

Technology Category

Application Category

📝 Abstract
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
Problem

Research questions and friction points this paper is trying to address.

Evaluating code quality in LLMs beyond natural language tasks
Assessing real-world performance of code generation in multiple languages
Addressing gaps in understanding LLM capabilities for non-English coding
Innovation

Methods, ideas, or system contributions that make the work stand out.

MERA Code benchmark for Russian code generation
11 tasks across 8 programming languages
Open-source scoring system and leaderboard platform
🔎 Similar Papers
Artem Chervyakov
Artem Chervyakov
SberAI
Искусственный интеллект
Alexander Kharitonov
Alexander Kharitonov
Natural Language Processing Researcher
Natural Language ProcessingArtificial Intelligence
Pavel Zadorozhny
Pavel Zadorozhny
Unknown affiliation
A
Adamenko Pavel
SberAI
Rodion Levichev
Rodion Levichev
Unknown affiliation
D
Dmitrii Vorobev
SberAI
D
Dmitrii Salikhov
SberAI
A
Aidar Valeev
SberAI
A
Alena Pestova
MWS AI
M
Maria Dziuba
ITMO University, MWS AI
Ilseyar Alimova
Ilseyar Alimova
КФУ, Высшая школа ИТИС
A
Artem Zavgorodnev
T-Technologies
A
Aleksandr Medvedev
T-Technologies
Stanislav Moiseev
Stanislav Moiseev
T-Technologies
computer scienceAImathematics
Elena Bruches
Elena Bruches
Старший преподаватель, Новосибирский Государственный Университет
обработка естественных языков
D
Daniil Grebenkin
T-Technologies, Siberian Neuronets
R
Roman Derunets
T-Technologies, Siberian Neuronets
V
Vikulov Vladimir
Rostelecom
A
Anton Emelyanov
SberAI
D
Dmitrii Babaev
SberAI
V
Vladimir V. Ivanov
Innopolis University
Valentin Malykh
Valentin Malykh
MTS AI / ITMO University
Artificial IntelligenceNatural Language UnderstandingNatural Language ProcessingDialog Systems
A
Alena Fenogenova
SberAI