π€ AI Summary
This work addresses the notably weaker programming performance of GPT-5 in the low-resource functional language Idris compared to mainstream languages, primarily due to its ineffective use of local compilation and test feedback. To bridge this gap, the authors propose an iterative, compiler-feedback-driven prompting strategy that, for the first time, integrates structured compiler errors and test failure information into the large language modelβs reasoning process. By incorporating error-type-guided refinement, documentation-augmented prompts, and systematic feedback integration, the approach enables adaptive self-correction by the model. Evaluated on the Exercism platform, this method boosts GPT-5βs Idris exercise pass rate from 22 out of 56 to 54 out of 56, approaching its performance in mainstream languages and substantially narrowing the capability gap for low-resource programming languages.
π Abstract
GPT-5, a state of the art large language model from OpenAI, demonstrates strong performance in widely used programming languages such as Python, C++, and Java; however, its ability to operate in low resource or less commonly used languages remains underexplored. This work investigates whether GPT-5 can effectively acquire proficiency in an unfamiliar functional programming language, Idris, through iterative, feedback driven prompting. We first establish a baseline showing that with zero shot prompting the model solves only 22 out of 56 Idris exercises using the platform Exercism, substantially underperforming relative to higher resource languages (45 out of 50 in Python and 35 out of 47 in Erlang). We then evaluate several refinement strategies, including iterative prompting based on platform feedback, augmenting prompts with documentation and error classification guides, and iterative prompting using local compilation errors and failed test cases. Among these approaches, incorporating local compilation errors yields the most substantial improvements. Using this structured, error guided refinement loop, GPT-5 performance increased to an impressive 54 solved problems out of 56. These results suggest that while large language models may initially struggle in low resource settings, structured compiler level feedback can play a critical role in unlocking their capabilities.