🤖 AI Summary
This work addresses the poor performance of current large language models on COBOL code generation and translation tasks, which are critical for legacy business systems. The authors propose a high-quality COBOL data construction methodology that integrates compiler-guided validation with multi-stage similarity filtering, and introduce COBOL-JavaTrans—the first benchmark for bidirectional COBOL–Java translation. By combining automated data cleaning, domain-adaptive fine-tuning, compiler feedback validation, and human evaluation, their trained model achieves a compilation success rate of 73.95% and a Pass@1 score of 49.33 on COBOLEval, substantially outperforming GPT-4o and leading open-source models. In the Java-to-COBOL direction, the model attains a Pass@1 of 34.93, receiving positive assessment from experienced COBOL developers.
📝 Abstract
COBOL remains a critical language for mainframe systems, yet existing large language models (LLMs) struggle to generate and translate COBOL code correctly. This paper reports our experience in developing and evaluating domain-adapted LLMs for COBOL and mainframe software engineering. We introduce (1) an automated data curation pipeline that combines compiler-guided validation with multi-stage similarity-based filtering to construct high-quality COBOL training data, and (2) COBOL-Coder, a COBOL-specialized LLM fine-tuned on the curated COBOL domain data. We evaluate COBOL-Coder on two tasks: code generation (on COBOLEval and COBOLCodeBench) and code translation (on COBOL-JavaTrans, our proposed benchmark for bidirectional COBOL-Java translation). In our experiments, COBOL-Coder achieves up to a 73.95 percent compilation success rate and 49.33 Pass-1 on COBOLEval, compared to 41.8 percent and 16.4 for GPT-4o, while most open-source baselines (e.g., CodeGemma, CodeLlama, StarCoder2) fail to produce compilable programs. For Java-to-COBOL translation, COBOL-Coder reaches 34.93 Pass-1, whereas general-purpose LLMs achieve near-zero scores. To assess the usability of LLM-generated code in real-world settings, we conduct a survey with experienced COBOL developers. Participants consistently report that COBOL-Coder exhibits stronger COBOL awareness, has more reliable program structure, and is better aligned with enterprise practices than general-purpose LLMs.