HICode: Hierarchical Inductive Coding with LLMs

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing fine-grained corpus analysis relies either on labor-intensive manual annotation or opaque statistical methods, compromising scalability and interpretability. This paper proposes a two-stage inductive coding framework powered by large language models (LLMs): (1) bottom-up prompt engineering for automated, fine-grained label generation; and (2) semantic embedding–guided hierarchical clustering to construct interpretable, multi-level topic structures. Grounded in qualitative research logic, the approach ensures analytical transparency and human controllability while substantially enhancing scalability for large-scale qualitative text analysis. Experiments across three heterogeneous datasets demonstrate high alignment with expert annotations (average F1 = 0.82). Applied to opioid litigation texts, the method successfully uncovered systemic, aggressive marketing strategies employed by pharmaceutical companies—validating its theoretical insightfulness and practical utility.

Technology Category

Application Category

📝 Abstract

Despite numerous applications for fine-grained corpus analysis, researchers continue to rely on manual labeling, which does not scale, or statistical tools like topic modeling, which are difficult to control. We propose that LLMs have the potential to scale the nuanced analyses that researchers typically conduct manually to large text corpora. To this effect, inspired by qualitative research methods, we develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes. We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations. Finally, we conduct a case study of litigation documents related to the ongoing opioid crisis in the U.S., revealing aggressive marketing strategies employed by pharmaceutical companies and demonstrating HICode's potential for facilitating nuanced analyses in large-scale data.

Problem

Research questions and friction points this paper is trying to address.

Automating fine-grained corpus analysis without manual labeling

Generating hierarchical themes inductively from text data

Scaling nuanced qualitative analysis to large text corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inductively generates labels from data

Hierarchically clusters labels into themes

Validates alignment with human-constructed themes

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs