JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study evaluates the capability of large language models (LLMs) to handle deprecated APIs during Java 8-to-11 migration. To this end, we introduce JMigBench, the first fine-grained benchmark encompassing eight categories of deprecated APIs, explicitly distinguishing between straightforward one-to-one replacements and complex migration scenarios. The benchmark is grounded in a high-quality dataset of function pairs derived from real-world open-source repositories. We conduct automated evaluations of models such as Mistral Codestral using CodeBLEU and keyword-matching metrics. Results show that Codestral achieves an 11.11% perfect accuracy rate on simple replacements but exhibits limited performance on complex cases involving technologies like CORBA and JAX-WS, indicating that while LLMs can assist in migration tasks, they cannot yet fully replace human expertise. This work provides a quantifiable framework for assessing LLMs’ code migration capabilities.

Technology Category

Application Category

📝 Abstract

We build a benchmark to evaluate large language models (LLMs) for source code migration tasks, specifically upgrading functions from Java 8 to Java 11. We first collected a dataset of function pairs from open-source repositories, but limitations in data quality led us to construct a refined dataset covering eight categories of deprecated APIs. Using this dataset, the Mistral Codestral model was evaluated with CodeBLEU and keyword-based metrics to measure lexical and semantic similarity as well as migration correctness. Results show that the evaluated model (Mistral Codestral) can handle trivial one-to-one API substitutions with moderate success, achieving identical migrations in 11.11% of the cases, but it struggles with more complex migrations such as CORBA or JAX-WS. These findings suggest Mistral Codestral can partially reduce developer effort by automating repetitive migration tasks but cannot yet replace humans within the scope of the JMigBench benchmark. The benchmark and analysis provide a foundation for future work on expanding datasets, refining prompting strategies, and improving migration performance across different LLMs.

Problem

Research questions and friction points this paper is trying to address.

code migration

large language models

Java 8 to Java 11

deprecated APIs

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

code migration

large language models

benchmark