TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision

📅 2024-02-29

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses hierarchical text classification under extremely low supervision—leveraging only category names (no labeled examples) and unlabeled corpora—to overcome LLMs’ poor zero-shot performance and high inference overhead on hierarchical structures. The method introduces an automatic label taxonomy enhancement mechanism that integrates LLM prior knowledge with unsupervised semantic feature mining; it further proposes a hierarchy-aware LLM-based data annotation and generation framework that explicitly models the label tree structure and supports dynamic expansion. Evaluated on multiple benchmarks, the approach significantly outperforms existing weakly supervised methods, matches LLM zero-shot inference accuracy, and reduces inference cost by over one order of magnitude. Its core contribution is the first fully label-name-driven, structure-aware hierarchical weak supervision framework—eliminating reliance on labeled data while preserving hierarchical semantics and scalability.

Technology Category

Application Category

📝 Abstract

Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.

Problem

Research questions and friction points this paper is trying to address.

Minimizes supervision in hierarchical text classification

Enhances taxonomy with LLM and unlabeled corpus

Reduces inference cost compared to zero-shot LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-enhanced hierarchical classification

Taxonomy enrichment with class features

Minimal supervision using class names

🔎 Similar Papers

No similar papers found.