ML2B: Multi-Lingual ML Benchmark For AutoML

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing machine learning (ML) code generation benchmarks are heavily English-centric, failing to address the global demand for multilingual ML development. To bridge this gap, we introduce ML2B—the first multilingual benchmark for ML code generation—covering 13 languages and 30 Kaggle competition tasks across text, image, and tabular modalities. ML2B comprises a human-verified multilingual translation dataset, a cross-lingual performance analysis framework, an end-to-end automated evaluation pipeline (AIDE), and a structured metadata annotation scheme. Experimental results reveal a significant 15–45% performance degradation on non-English tasks, underscoring the critical challenge of multilingual representation learning in code generation. We open-source the full benchmark, toolchain, and evaluation protocol to support reproducible research and advance multilingual AutoML.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual ML code generation capabilities

Addressing performance gaps in non-English ML tasks

Providing benchmark for cross-lingual AutoML assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual ML benchmark with 13 languages

Automated framework for pipeline assessment

Reveals performance gaps in non-English tasks

🔎 Similar Papers

No similar papers found.