Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

📅 2023-12-22

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient accuracy and stability in executing programming instructions. Method: We propose Turbulence, a systematic evaluation framework that introduces parameterized programming problem templates and an automated test oracle; critically, it pioneers the “problem neighborhood” testing paradigm—applying neighborhood perturbations and multi-temperature sampling to precisely identify abrupt performance degradation points across semantically similar problems, moving beyond coarse-grained error-rate statistics. Contribution/Results: Evaluated on five mainstream LLMs (from OpenAI, Cohere, and Meta), Turbulence effectively exposes critical robustness deficiencies, uncovering numerous fragile patterns characterized by high aggregate accuracy yet localized failures. It establishes a new fine-grained diagnostic benchmark for assessing and improving LLM code generation capabilities.

📝 Abstract

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $ extit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $ extit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $ extit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $ extit{anomalies}$ where the LLM correctly solves $ extit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $ extit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Programming Tasks

Instruction Execution Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Turbulence testing method

Programming task evaluation

Large language model assessment

🔎 Similar Papers

No similar papers found.