🤖 AI Summary
Manually constructing disciplinary taxonomy trees for scientific knowledge organization is labor-intensive, prone to human bias, and often fails to incorporate low-citation yet high-impact publications. Method: We propose HiGTL, an end-to-end framework that jointly optimizes structural consistency and semantic coherence by integrating hierarchical graph clustering with LLM-driven, layer-wise concept concretization. It combines hierarchical graph clustering, LLM-based node concept generation, multi-objective joint fine-tuning, and text-citation dual-modality representation learning. Results: Experiments demonstrate that HiGTL-generated taxonomies significantly outperform state-of-the-art methods in hierarchical plausibility, concept accuracy, and cross-layer consistency. The framework enables interpretable, interactive, and human-guided taxonomy construction, effectively supporting systematic literature reviews and emerging trend identification.
📝 Abstract
Constructing taxonomies from citation graphs is essential for organizing scientific knowledge, facilitating literature reviews, and identifying emerging research trends. However, manual taxonomy construction is labor-intensive, time-consuming, and prone to human biases, often overlooking pivotal but less-cited papers. In this paper, to enable automatic hierarchical taxonomy generation from citation graphs, we propose HiGTL (Hierarchical Graph Taxonomy Learning), a novel end-to-end framework guided by human-provided instructions or preferred topics. Specifically, we propose a hierarchical citation graph clustering method that recursively groups related papers based on both textual content and citation structure, ensuring semantically meaningful and structurally coherent clusters. Additionally, we develop a novel taxonomy node verbalization strategy that iteratively generates central concepts for each cluster, leveraging a pre-trained large language model (LLM) to maintain semantic consistency across hierarchical levels. To further enhance performance, we design a joint optimization framework that fine-tunes both the clustering and concept generation modules, aligning structural accuracy with the quality of generated taxonomies. Extensive experiments demonstrate that HiGTL effectively produces coherent, high-quality taxonomies.