🤖 AI Summary
Low inter-coder reliability and high time cost plague qualitative thematic analysis. This paper proposes a multi-LLM collaborative thematic analysis framework, introducing the first dual-dimension reliability validation paradigm—combining coding consistency (Cohen’s Kappa) and semantic consistency (embedding cosine similarity). The framework integrates Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Sonnet, supporting parameterized configuration of seeds, temperature, and prompt templates to enable JSON-agnostic consensus theme extraction. An ensemble consensus algorithm based on multiple independent reasoning rounds ensures reproducibility, and the configurable analysis pipeline is open-sourced. Evaluated on psychedelic art therapy interview data, the framework achieves mean Kappa >0.80 (max 0.907) and semantic similarity >92% across models, extracting 6, 5, and 4 highly consistent themes per model (50–83% cross-round coverage), significantly outperforming single-run LLM analysis. This work establishes a methodological benchmark for AI-augmented qualitative research.
📝 Abstract
Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($κ$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($κ= 0.907$, cosine=95.3%), followed by GPT-4o ($κ= 0.853$, cosine=92.6%) and Claude ($κ= 0.842$, cosine=92.1%). All three models achieve a high agreement ($κ> 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.