CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the capability of large language models (LLMs) in cross-language code generation—particularly for interoperable code involving inter-process communication and other cross-language mechanisms. Method: We introduce CrossPL, the first dedicated benchmark for this task, covering six mainstream programming languages and seven cross-language interoperability techniques, with 1,982 tasks derived from real-world multilingual repositories. Our novel pipeline integrates finite-state machine analysis (applied to 156 repositories and >19K projects) with LLM-driven automation for task extraction, instruction generation, and functional validation. Contribution/Results: Systematic evaluation of 20 state-of-the-art LLMs reveals significant performance limitations on cross-language interoperability tasks, highlighting a critical research gap. CrossPL is publicly released to serve as a reproducible, extensible evaluation infrastructure for future work.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become increasingly embedded in software engineering workflows, a critical capability remains underexplored: generating correct code that enables cross-programming-language (CPL) interoperability. This skill is essential for building complex systems that integrate components written in multiple languages via mechanisms like inter-process communication (IPC). To bridge this gap, we present CrossPL, the first benchmark designed to systematically evaluate LLMs' ability to generate CPL-interoperating code. CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques. We construct this benchmark by (i) analyzing 19,169 multi-language GitHub repositories using 156 hand-crafted finite state machines (FSMs), and (ii) developing an LLM-based pipeline that automatically extracts CPL code snippets, generates task instructions, and validates functional correctness. We evaluate 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL via FSM-based validation. Results reveal that even the best-performing models struggle with CPL scenarios, underscoring the need for more targeted research in this space. Our benchmark and code are available at: https://anonymous.4open.science/r/crosspl-2814.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on cross-programming-language code generation
Assessing CPL-interoperating code generation via IPC mechanisms
Benchmarking LLMs' ability to handle multi-language integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for cross-language code generation
Uses finite state machines for validation
LLM-based pipeline extracts and validates code
🔎 Similar Papers
No similar papers found.