MUCOCO: Automated Consistency Testing of Code LLMs

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work addresses the challenge of inconsistent program behaviors exhibited by current code large language models (Code LLMs), which are inadequately captured by conventional static benchmarks. To this end, the authors propose MUCOCO, the first automated testing framework for evaluating Code LLM consistency. MUCOCO generates semantically equivalent program variants through semantics-preserving mutations and systematically uncovers behavioral contradictions across diverse tasks by combining automated execution with differential analysis. Evaluated across four programming task categories and seven state-of-the-art models, MUCOCO reveals inconsistencies in approximately 15% of mutated inputs—significantly outperforming the previous best method, TURBULENCE.

Technology Category

Application Category

📝 Abstract
Code LLMs often portray inconsistent program behaviors. Developers typically employ benchmarks to assess Code LLMs, but most benchmarks are hand-crafted, static and do not target consistency property. In this work, we pose the scientific question: how can we automatically discover inconsistent program behaviors in Code LLMs? To address this challenge, we propose an automated consistency testing method, called MUCOCO, which employs semantic-preserving mutation analysis to expose inconsistent behaviors in code LLMs. Given a coding query, MUCOCO automatically transforms its program into semantically equivalent programs (aka mutants) and detects inconsistencies between the mutants and the original program (e.g., different output or test failure). We evaluate MUCOCO using four (4) coding tasks and seven (7) LLMs. Results show that MUCOCO is effective in exposing inconsistency and outperforms the closest baseline (TURBULENCE). About one in seven (15%) inputs generated by MUCOCO exposed inconsistencies. Our work motivates the need to test Code LLMs for consistency property
Problem

Research questions and friction points this paper is trying to address.

Code LLMs
consistency
inconsistent behaviors
automated testing
semantic equivalence
Innovation

Methods, ideas, or system contributions that make the work stand out.

consistency testing
code LLMs
semantic-preserving mutation
automated testing
program equivalence