CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation benchmarks for large language models on low-resource general-purpose programming languages, as existing efforts predominantly focus on domain-specific languages. We present CangjieBench, the first contamination-free benchmark for the Cangjie programming language, comprising 248 high-quality human-translated samples spanning text-to-code and code-to-code tasks. We systematically evaluate prominent models under four paradigms: direct generation, syntax-constrained generation, retrieval-augmented generation (RAG), and agent-based approaches. Our experiments reveal that syntax-constrained generation achieves the best trade-off between accuracy and computational cost, while agent-based methods yield the highest accuracy at substantial computational expense. Notably, code-to-code tasks consistently underperform, exposing negative transfer caused by overfitting to source-language patterns.

Technology Category

Application Category

📝 Abstract

Large Language Models excel in high-resource programming languages but struggle with low-resource ones. Existing research related to low-resource programming languages primarily focuses on Domain-Specific Languages (DSLs), leaving general-purpose languages that suffer from data scarcity underexplored. To address this gap, we introduce CangjieBench, a contamination-free benchmark for Cangjie, a representative low-resource general-purpose language. The benchmark comprises 248 high-quality samples manually translated from HumanEval and ClassEval, covering both Text-to-Code and Code-to-Code tasks. We conduct a systematic evaluation of diverse LLMs under four settings: Direct Generation, Syntax-Constrained Generation, Retrieval-Augmented Generation (RAG), and Agent. Experiments reveal that Direct Generation performs poorly, whereas Syntax-Constrained Generation offers the best trade-off between accuracy and computational cost. Agent achieve state-of-the-art accuracy but incur high token consumption. Furthermore, we observe that Code-to-Code translation often underperforms Text-to-Code generation, suggesting a negative transfer phenomenon where models overfit to the source language patterns. We hope that our work will offer valuable insights into LLM generalization to unseen and low-resource programming languages. Our code and data are available at https://github.com/cjhCoder7/CangjieBench.

Problem

Research questions and friction points this paper is trying to address.

low-resource programming languages

general-purpose languages

LLM generalization

code generation

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource programming language

CangjieBench

syntax-constrained generation