CONCUR: Benchmarking LLMs for Concurrent Code Generation

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing code generation benchmarks primarily focus on sequential code and are ill-equipped to evaluate large language models’ capabilities in concurrent programming, which presents unique challenges such as deadlocks and race conditions. To address this gap, this work proposes CONCUR, the first benchmark specifically designed for evaluating concurrent code generation. CONCUR comprises 115 tasks derived from 43 core problems drawn from classic textbooks, augmented with 72 structurally diverse variants generated via mutation testing, systematically covering a broad spectrum of concurrency semantics and linguistic constructs. Comprehensive evaluation of prominent large language models reveals significant deficiencies in their ability to generate correct concurrent code. CONCUR thus effectively fills a critical void in existing evaluation frameworks and provides a reliable benchmark for future model development and capability assessment in concurrent programming.

Technology Category

Application Category

📝 Abstract

Leveraging Large Language Models (LLMs) for code generation has increasingly emerged as a common practice in the domain of software engineering. Relevant benchmarks have been established to evaluate the code generation capabilities of LLMs. However, existing benchmarks focus primarily on sequential code, lacking the ability to effectively evaluate LLMs on concurrent code generation. Compared to sequential code, concurrent code exhibits greater complexity and possesses unique types of bugs, such as deadlocks and race conditions, that do not occur in sequential code. Therefore, a benchmark for evaluating sequential code generation cannot be useful for evaluating concurrent code generation with LLMs. To address this gap, we designed a benchmark CONCUR specifically aimed at evaluating the capability of LLMs to generate concurrent code. CONCUR consists of a base set of 43 concurrency problems derived from a standard concurrency textbook, together with 72 validated mutant variants, resulting in 115 total problems. The base problems serve as the semantic core of the benchmark, while the mutants expand linguistic and structural diversity. We conducted an evaluation of a range of LLMs on CONCUR, highlighting limitations of current models. Overall, our work provides a novel direction for evaluating the capability of LLMs to generate code with focus on concurrency.

Problem

Research questions and friction points this paper is trying to address.

concurrent code generation

large language models

benchmark

code evaluation

concurrency bugs

Innovation

Methods, ideas, or system contributions that make the work stand out.

concurrent code generation

LLM benchmarking

CONCUR