Taxonomy Tree Generation from Citation Graph

📅 2024-10-02
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Manually constructing disciplinary taxonomy trees for scientific knowledge organization is labor-intensive, prone to human bias, and often fails to incorporate low-citation yet high-impact publications. Method: We propose HiGTL, an end-to-end framework that jointly optimizes structural consistency and semantic coherence by integrating hierarchical graph clustering with LLM-driven, layer-wise concept concretization. It combines hierarchical graph clustering, LLM-based node concept generation, multi-objective joint fine-tuning, and text-citation dual-modality representation learning. Results: Experiments demonstrate that HiGTL-generated taxonomies significantly outperform state-of-the-art methods in hierarchical plausibility, concept accuracy, and cross-layer consistency. The framework enables interpretable, interactive, and human-guided taxonomy construction, effectively supporting systematic literature reviews and emerging trend identification.

Technology Category

Application Category

📝 Abstract
Constructing taxonomies from citation graphs is essential for organizing scientific knowledge, facilitating literature reviews, and identifying emerging research trends. However, manual taxonomy construction is labor-intensive, time-consuming, and prone to human biases, often overlooking pivotal but less-cited papers. In this paper, to enable automatic hierarchical taxonomy generation from citation graphs, we propose HiGTL (Hierarchical Graph Taxonomy Learning), a novel end-to-end framework guided by human-provided instructions or preferred topics. Specifically, we propose a hierarchical citation graph clustering method that recursively groups related papers based on both textual content and citation structure, ensuring semantically meaningful and structurally coherent clusters. Additionally, we develop a novel taxonomy node verbalization strategy that iteratively generates central concepts for each cluster, leveraging a pre-trained large language model (LLM) to maintain semantic consistency across hierarchical levels. To further enhance performance, we design a joint optimization framework that fine-tunes both the clustering and concept generation modules, aligning structural accuracy with the quality of generated taxonomies. Extensive experiments demonstrate that HiGTL effectively produces coherent, high-quality taxonomies.
Problem

Research questions and friction points this paper is trying to address.

Automates taxonomy generation from citation graphs
Recursively clusters papers using text and citations
Generates coherent taxonomies with hierarchical concept verbalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical citation graph clustering
Taxonomy node verbalization strategy
Joint optimization framework fine-tuning