ML2B: Multi-Lingual ML Benchmark For AutoML

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing machine learning (ML) code generation benchmarks are heavily English-centric, failing to address the global demand for multilingual ML development. To bridge this gap, we introduce ML2B—the first multilingual benchmark for ML code generation—covering 13 languages and 30 Kaggle competition tasks across text, image, and tabular modalities. ML2B comprises a human-verified multilingual translation dataset, a cross-lingual performance analysis framework, an end-to-end automated evaluation pipeline (AIDE), and a structured metadata annotation scheme. Experimental results reveal a significant 15–45% performance degradation on non-English tasks, underscoring the critical challenge of multilingual representation learning in code generation. We open-source the full benchmark, toolchain, and evaluation protocol to support reproducible research and advance multilingual AutoML.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual ML code generation capabilities
Addressing performance gaps in non-English ML tasks
Providing benchmark for cross-lingual AutoML assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual ML benchmark with 13 languages
Automated framework for pipeline assessment
Reveals performance gaps in non-English tasks
🔎 Similar Papers
No similar papers found.
Ekaterina Trofimova
Ekaterina Trofimova
младший научный сотрудник, НИУ Высшая школа экономики
машинное обучение
Z
Zosia Shamina
HSE University
M
Maria Selifanova
HSE University
A
Artem Zaitsev
HSE University
R
Remi Savchuk
HSE University
M
Maxim Minets
HSE University
D
Daria Ozerova
HSE University
E
Emil Sataev
HSE University
D
Denis Zuenko
HSE University
A
Andrey E. Ustyuzhanin
Constructor University, Bremen, Germany National University of Singapore, Singapore, Singapore