LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study presents the first systematic evaluation of large language models’ (LLMs’) capability to perform automated qualitative coding in cybersecurity contexts, aiming to replace costly expert human annotation. Using the LiveBench platform, four state-of-the-art LLMs were prompted with realistic strategies—including detailed coding manuals, exemplar guidance, and conflicting examples—to code free-text participant comments on vulnerable code according to security-relevant categories. Inter-rater agreement between model-generated and human annotations was assessed using Cohen’s Kappa. Results indicate that LLM performance improves only marginally when provided with detailed code descriptions, yet remains inconsistent overall, falling short of reliably substituting human coders. These findings highlight current limitations in LLMs’ ability to comprehend nuanced technical contexts inherent in security-related qualitative analysis.

Technology Category

Application Category

📝 Abstract

[Background:] Thematic analysis of free-text justifications in human experiments provides significant qualitative insights. Yet, it is costly because reliable annotations require multiple domain experts. Large language models (LLMs) seem ideal candidates to replace human annotators. [Problem:] Coding security-specific aspects (code identifiers mentioned, lines-of-code mentioned, security keywords mentioned) may require deeper contextual understanding than sentiment classification. [Objective:] Explore whether LLMs can act as automated annotators for technical security comments by human subjects. [Method:] We prompt four top-performing LLMs on LiveBench to detect nine security-relevant codes in free-text comments by human subjects analyzing vulnerable code snippets. Outputs are compared to human annotators using Cohen's Kappa (chance-corrected accuracy). We test different prompts mimicking annotation best practices, including emerging codes, detailed codebooks with examples, and conflicting examples. [Negative Results:] We observed marked improvements only when using detailed code descriptions; however, these improvements are not uniform across codes and are insufficient to reliably replace a human annotator. [Limitations:] Additional studies with more LLMs and annotation tasks are needed.

Problem

Research questions and friction points this paper is trying to address.

large language models

qualitative data analysis

security-specific comments

thematic analysis

human experiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models

qualitative data analysis

security annotation