MUCOCO: Automated Consistency Testing of Code LLMs

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the challenge of inconsistent program behaviors exhibited by current code large language models (Code LLMs), which are inadequately captured by conventional static benchmarks. To this end, the authors propose MUCOCO, the first automated testing framework for evaluating Code LLM consistency. MUCOCO generates semantically equivalent program variants through semantics-preserving mutations and systematically uncovers behavioral contradictions across diverse tasks by combining automated execution with differential analysis. Evaluated across four programming task categories and seven state-of-the-art models, MUCOCO reveals inconsistencies in approximately 15% of mutated inputs—significantly outperforming the previous best method, TURBULENCE.

Technology Category

Application Category

📝 Abstract

Code LLMs often portray inconsistent program behaviors. Developers typically employ benchmarks to assess Code LLMs, but most benchmarks are hand-crafted, static and do not target consistency property. In this work, we pose the scientific question: how can we automatically discover inconsistent program behaviors in Code LLMs? To address this challenge, we propose an automated consistency testing method, called MUCOCO, which employs semantic-preserving mutation analysis to expose inconsistent behaviors in code LLMs. Given a coding query, MUCOCO automatically transforms its program into semantically equivalent programs (aka mutants) and detects inconsistencies between the mutants and the original program (e.g., different output or test failure). We evaluate MUCOCO using four (4) coding tasks and seven (7) LLMs. Results show that MUCOCO is effective in exposing inconsistency and outperforms the closest baseline (TURBULENCE). About one in seven (15%) inputs generated by MUCOCO exposed inconsistencies. Our work motivates the need to test Code LLMs for consistency property

Problem

Research questions and friction points this paper is trying to address.

Code LLMs

consistency

inconsistent behaviors

automated testing

semantic equivalence

Innovation

Methods, ideas, or system contributions that make the work stand out.

consistency testing

code LLMs

semantic-preserving mutation