Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

📅 2024-08-10

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) can reliably replace human annotators to mitigate the high cost and logistical complexity of human-subject studies in software engineering innovation evaluation. We systematically evaluate six state-of-the-art LLMs across ten code-related annotation tasks—including code summary quality assessment and defect repair judgment—using five public datasets. Methodologically, we propose *inter-model agreement* as a novel task-adaptability predictor and integrate confidence-threshold filtering to identify samples safe for LLM-only annotation, thereby establishing a hybrid human–LLM evaluation paradigm. Results show that LLMs achieve or approach human inter-annotator agreement (Krippendorff’s α ≥ 0.8) on multiple tasks; inter-model agreement strongly predicts task feasibility (AUC = 0.92); and confidence-based filtering raises replacement accuracy to 94.3%.

Technology Category

Application Category

📝 Abstract

Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

Problem

Research questions and friction points this paper is trying to address.

LLMs replace human annotation

Software engineering artifact evaluation

Model-model agreement predictor

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs replace human annotation

Model-model agreement predicts suitability

Model confidence selects safe replacements

🔎 Similar Papers

No similar papers found.