Tricky$^2$: Towards a Benchmark for Evaluating Human and LLM Error Interactions

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the lack of systematic benchmarks for evaluating hybrid errors arising from the interaction between human programmers and large language models (LLMs) in software development, noting that the error patterns introduced by LLMs differ significantly from those made by humans. To bridge this gap, we propose Tricky²—the first multilingual program defect dataset encompassing human-written, LLM-generated, and hybrid errors across C++, Python, and Java. Leveraging a taxonomy-guided prompting strategy and structure-preserving error injection techniques, we enable controlled synthesis of multi-source errors. We conduct baseline experiments on error classification, localization, and repair tasks to validate the dataset’s utility, thereby establishing a critical resource and evaluation foundation for investigating code reliability in human–LLM collaborative programming.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly integrated into software development workflows, yet they often introduce subtle logic or data-misuse errors that differ from human bugs. To study how these two error types interact, we construct Tricky$^2$, a hybrid dataset that augments the existing TrickyBugs corpus of human-written defects with errors injected by both GPT-5 and OpenAI-oss-20b across C++, Python, and Java programs. Our approach uses a taxonomy-guided prompting framework to generate machine-originated bugs while preserving original human defects and program structure. The resulting corpus spans human-only, LLM-only, and human+LLM splits, enabling analysis of mixed-origin error behavior, multi-bug repair robustness, and reliability in hybrid human-machine code. This paper outlines the dataset construction pipeline and illustrates its use through small-scale baseline evaluations of classification, localization, and repair tasks.

Problem

Research questions and friction points this paper is trying to address.

LLM errors

human bugs

error interaction

hybrid code

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid bug benchmark

taxonomy-guided prompting

human-LLM error interaction