🤖 AI Summary
Existing code benchmarks (e.g., HumanEval) suffer from severe programming language bias (>95% Python) and assess only code generation, neglecting code reasoning—i.e., input-to-output or output-to-input inference—thus failing to characterize LLMs’ true cross-lingual capabilities. To address this, we propose CRUXEVAL-X: the first multilingual benchmark for code reasoning, covering 19 languages and comprising 19K semantically consistent test cases. Its key innovations are: (1) the first systematic decoupling of language- and task-induced biases; (2) a translation rule system aligned with cross-language type systems; and (3) a fully automated, test-guided iterative pipeline—generate → execute → feedback → repair—requiring no human annotation. Evaluation across 24 state-of-the-art LLMs reveals strong language-specific correlations (e.g., high positive correlation between TypeScript and JavaScript; weak correlation for Racket) and shows that models trained exclusively on Python achieve at most 34.4% Pass@1 on non-Python languages, exposing a critical bottleneck in cross-lingual generalization.
📝 Abstract
Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation benchmarks are dominated by Python, leaving the LLMs' capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. In particular, the construction pipeline of CRUXEVAL-X works in a fully automated and test-guided manner, which iteratively generates and repairs based on execution feedback. Also, to cross language barriers (e.g., dynamic/static type systems in Python/C++), we formulated various transition rules between language pairs to facilitate translation. Our intensive evaluation of 24 representative LLMs reveals the correlation between language pairs. For example, TypeScript and JavaScript show a significant positive correlation, while Racket has less correlation with other languages. More interestingly, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.