Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low inter-coder reliability and high time cost plague qualitative thematic analysis. This paper proposes a multi-LLM collaborative thematic analysis framework, introducing the first dual-dimension reliability validation paradigm—combining coding consistency (Cohen’s Kappa) and semantic consistency (embedding cosine similarity). The framework integrates Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Sonnet, supporting parameterized configuration of seeds, temperature, and prompt templates to enable JSON-agnostic consensus theme extraction. An ensemble consensus algorithm based on multiple independent reasoning rounds ensures reproducibility, and the configurable analysis pipeline is open-sourced. Evaluated on psychedelic art therapy interview data, the framework achieves mean Kappa >0.80 (max 0.907) and semantic similarity >92% across models, extracting 6, 5, and 4 highly consistent themes per model (50–83% cross-round coverage), significantly outperforming single-run LLM analysis. This work establishes a methodological benchmark for AI-augmented qualitative research.

Technology Category

Application Category

📝 Abstract
Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($κ$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($κ= 0.907$, cosine=95.3%), followed by GPT-4o ($κ= 0.853$, cosine=92.6%) and Claude ($κ= 0.842$, cosine=92.1%). All three models achieve a high agreement ($κ> 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.
Problem

Research questions and friction points this paper is trying to address.

Addresses reliability challenges in qualitative research using LLMs
Combines Cohen's Kappa and semantic similarity for validation
Enables configurable multi-run thematic analysis with consensus extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-LLM ensemble validation with dual reliability metrics
Configurable analysis parameters and custom prompt structures
Structure-agnostic consensus theme extraction from JSON outputs