CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

📅 2024-08-23
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
Existing code benchmarks (e.g., HumanEval) suffer from severe programming language bias (>95% Python) and assess only code generation, neglecting code reasoning—i.e., input-to-output or output-to-input inference—thus failing to characterize LLMs’ true cross-lingual capabilities. To address this, we propose CRUXEVAL-X: the first multilingual benchmark for code reasoning, covering 19 languages and comprising 19K semantically consistent test cases. Its key innovations are: (1) the first systematic decoupling of language- and task-induced biases; (2) a translation rule system aligned with cross-language type systems; and (3) a fully automated, test-guided iterative pipeline—generate → execute → feedback → repair—requiring no human annotation. Evaluation across 24 state-of-the-art LLMs reveals strong language-specific correlations (e.g., high positive correlation between TypeScript and JavaScript; weak correlation for Racket) and shows that models trained exclusively on Python achieve at most 34.4% Pass@1 on non-Python languages, exposing a critical bottleneck in cross-lingual generalization.

Technology Category

Application Category

📝 Abstract
Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation benchmarks are dominated by Python, leaving the LLMs' capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. In particular, the construction pipeline of CRUXEVAL-X works in a fully automated and test-guided manner, which iteratively generates and repairs based on execution feedback. Also, to cross language barriers (e.g., dynamic/static type systems in Python/C++), we formulated various transition rules between language pairs to facilitate translation. Our intensive evaluation of 24 representative LLMs reveals the correlation between language pairs. For example, TypeScript and JavaScript show a significant positive correlation, while Racket has less correlation with other languages. More interestingly, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses programming language bias in code benchmarks
Fills gap in multilingual code reasoning evaluation
Automates benchmark construction to avoid data contamination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated test-guided multilingual benchmark construction
Formulated transition rules for cross-language translation
Evaluated 24 LLMs for cross-language generalization
🔎 Similar Papers
No similar papers found.