COBOL-Coder: Domain-Adapted Large Language Models for COBOL Code Generation and Translation

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor performance of current large language models on COBOL code generation and translation tasks, which are critical for legacy business systems. The authors propose a high-quality COBOL data construction methodology that integrates compiler-guided validation with multi-stage similarity filtering, and introduce COBOL-JavaTrans—the first benchmark for bidirectional COBOL–Java translation. By combining automated data cleaning, domain-adaptive fine-tuning, compiler feedback validation, and human evaluation, their trained model achieves a compilation success rate of 73.95% and a Pass@1 score of 49.33 on COBOLEval, substantially outperforming GPT-4o and leading open-source models. In the Java-to-COBOL direction, the model attains a Pass@1 of 34.93, receiving positive assessment from experienced COBOL developers.
📝 Abstract
COBOL remains a critical language for mainframe systems, yet existing large language models (LLMs) struggle to generate and translate COBOL code correctly. This paper reports our experience in developing and evaluating domain-adapted LLMs for COBOL and mainframe software engineering. We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data. We evaluate COBOL-Coder on two tasks: code generation (on COBOLEval and COBOLCodeBench) and code translation (on COBOL-JavaTrans, our proposed benchmark for bidirectional COBOL-Java translation). In our experiments, COBOL-Coder achieves up to a 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval, compared to 41.8 percent and 16.4 for GPT-4o, while most open-source baselines (e.g., CodeGemma, CodeLlama, StarCoder2) fail to produce compilable programs. For Java-to-COBOL translation, COBOL-Coder reaches 34.93 Pass-1, whereas general-purpose LLMs achieve near-zero scores. To assess the usability of LLM-generated code in real-world settings, we conduct a survey with experienced COBOL developers. Participants consistently report that COBOL-Coder exhibits stronger COBOL awareness, has more reliable program structure, and is better aligned with enterprise practices than general-purpose LLMs.
Problem

Research questions and friction points this paper is trying to address.

COBOL
code generation
code translation
large language models
mainframe systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-adapted LLM
COBOL code generation
compiler-guided validation
code translation
data curation pipeline
🔎 Similar Papers
2024-03-252024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:Citations: 22