🤖 AI Summary
This work addresses hierarchical text classification under extremely low supervision—leveraging only category names (no labeled examples) and unlabeled corpora—to overcome LLMs’ poor zero-shot performance and high inference overhead on hierarchical structures. The method introduces an automatic label taxonomy enhancement mechanism that integrates LLM prior knowledge with unsupervised semantic feature mining; it further proposes a hierarchy-aware LLM-based data annotation and generation framework that explicitly models the label tree structure and supports dynamic expansion. Evaluated on multiple benchmarks, the approach significantly outperforms existing weakly supervised methods, matches LLM zero-shot inference accuracy, and reduces inference cost by over one order of magnitude. Its core contribution is the first fully label-name-driven, structure-aware hierarchical weak supervision framework—eliminating reliance on labeled data while preserving hierarchical semantics and scalability.
📝 Abstract
Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.