Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current large language models (LLMs) lack reproducible integration methods and systematic quality evaluation frameworks for qualitative software engineering research—particularly thematic analysis (TA). Method: We propose the first LLM-oriented prompt engineering and evaluation framework for TA, grounded in Braun & Clarke’s reflexive TA approach. It includes reproducible prompt templates and a blinded evaluation rubric aligned with established trustworthiness criteria (e.g., Lincoln & Guba). We conduct empirical evaluation across three models (GPT-4, Claude, Llama3) on 15 software engineer well-being interview transcripts. Contribution/Results: Expert blind assessment reveals that 61% of LLM-generated codes surpass human-coded ones in quality; however, critical limitations—including data fragmentation and meaning omission—are identified. The study establishes, for the first time, empirically informed boundaries for human–AI collaboration in qualitative analysis, providing both a methodological foundation and practical guidelines for LLM-augmented qualitative research.

Technology Category

Application Category

📝 Abstract

As artificial intelligence advances, large language models (LLMs) are entering qualitative research workflows, yet no reproducible methods exist for integrating them into established approaches like thematic analysis (TA), one of the most common qualitative methods in software engineering research. Moreover, existing studies lack systematic evaluation of LLM-generated qualitative outputs against established quality criteria. We designed and iteratively refined prompts for Phases 2-5 of Braun and Clarke's reflexive TA, then tested outputs from multiple LLMs against codes and themes produced by experienced researchers. Using 15 interviews on software engineers' well-being, we conducted blind evaluations with four expert evaluators who applied rubrics derived directly from Braun and Clarke's quality criteria. Evaluators preferred LLM-generated codes 61% of the time, finding them analytically useful for answering the research question. However, evaluators also identified limitations: LLMs fragmented data unnecessarily, missed latent interpretations, and sometimes produced themes with unclear boundaries. Our contributions are threefold. First, a reproducible approach integrating refined, documented prompts with an evaluation framework to operationalize Braun and Clarke's reflexive TA. Second, an empirical comparison of LLM- and human-generated codes and themes in software engineering data. Third, guidelines for integrating LLMs into qualitative analysis while preserving methodological rigour, clarifying when and how LLMs can assist effectively and when human interpretation remains essential.

Problem

Research questions and friction points this paper is trying to address.

Developing reproducible methods for integrating LLMs into thematic analysis workflows

Systematically evaluating LLM-generated qualitative outputs against established quality criteria

Providing guidelines for using LLMs while preserving methodological rigor

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively refined prompts for thematic analysis phases

Evaluated LLM outputs using expert rubrics and criteria

Developed guidelines for integrating LLMs while maintaining rigor

🔎 Similar Papers

Can Large Language Models Serve as Data Analysts? A Multi-Agent Assisted Approach for Qualitative Data Analysis