🤖 AI Summary
Existing benchmarks for detecting AI-generated code exhibit significant limitations under real-world challenges such as distribution shifts, multilingual settings, diverse generative models, and hybrid or adversarial code. To address this gap, this work introduces AICD Bench—a large-scale benchmark comprising two million samples spanning nine programming languages, 77 distinct models, and 11 model families. It systematically incorporates novel evaluation tasks including cross-lingual and domain-shift robustness, model-family attribution, and detection of mixed or adversarial code. By integrating code generated by a wide array of large language models and defining three core evaluation dimensions—robust binary classification, model-family identification, and fine-grained human-vs-machine classification—the benchmark enables comprehensive assessment of both neural and traditional detectors. Experimental results reveal substantial performance degradation of current methods in complex scenarios, underscoring the critical need for AICD Bench as a unified and challenging evaluation platform.
📝 Abstract
Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce $\emph{AICD Bench}$, the most comprehensive benchmark for AI-generated code detection. It spans $\emph{2M examples}$, $\emph{77 models}$ across $\emph{11 families}$, and $\emph{9 programming languages}$, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ($\emph{i}$)~$\emph{Robust Binary Classification}$ under distribution shifts in language and domain, ($\emph{ii}$)~$\emph{Model Family Attribution}$, grouping generators by architectural lineage, and ($\emph{iii}$)~$\emph{Fine-Grained Human-Machine Classification}$ across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a $\emph{unified, challenging evaluation suite}$ to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench}.