๐ค AI Summary
This work addresses the challenge of efficiently constructing a comprehensive and well-structured taxonomy of artificial intelligence skills and tasks from massive hiring data. To this end, the authors propose TaxonomyBuilder, a framework that integrates systematic data filtering, clustering algorithms, and large language modelโenhanced hierarchical label generation to automatically derive domain-specific taxonomies from curated, high-quality data subsets. Experimental results demonstrate that taxonomies built from filtered data exhibit significantly broader coverage and superior structural coherence compared to those generated from raw, unfiltered data using existing methods. The study thus establishes a novel paradigm for data-driven, automated taxonomy construction in specialized domains.
๐ Abstract
Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.