🤖 AI Summary
To address the challenges of large label spaces, severe long-tail distribution, and difficulty in hierarchical semantic modeling in zero-shot hierarchical text classification (HTC), this paper proposes the first retrieval-augmented generation (RAG) framework integrating knowledge graphs (KGs) and large language models (LLMs). Our method dynamically retrieves label subgraphs and jointly embeds hierarchical paths to explicitly model multi-level semantic structures; it further leverages hierarchical prompt engineering and graph-enhanced LLM inference to overcome hierarchical generalization bottlenecks inherent in conventional zero-shot approaches. Evaluated under strict zero-shot settings on WoS, DBpedia, and Amazon, our framework achieves a 23.6% absolute improvement in F1 score on deep-level categories. This substantially alleviates challenges posed by high-dimensional label spaces and long-tail distributions, establishing a scalable, semantics-driven paradigm for low-resource HTC.
📝 Abstract
Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: https://github.com/QianboZang/KG-HTC.