Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of implicit semantic hierarchy and absence of explicit categories in textual data. To this end, we propose a nested density clustering method that automatically constructs interpretable semantic hierarchy trees from large language model embeddings—without requiring predefined categories. Our core contribution is the novel nested density evolution mechanism, which dynamically models hierarchical structure by tracking density gradients in vector space—from dense to sparse regions. The method enables unsupervised topic discovery and evolutionary analysis across diverse domains, including scientific abstracts, news articles, and movie reviews. Evaluated on scientific paper abstracts, 20 Newsgroups, and IMDB datasets, it successfully reconstructs semantic trees and achieves highly consistent topic classification, improving average F1-score by 12.3% over baselines. The approach effectively supports automated research domain partitioning and inference of topic evolution trajectories.

Technology Category

Application Category

📝 Abstract
Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster - the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 News- groups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.
Problem

Research questions and friction points this paper is trying to address.

Infer hierarchical semantic trees from text corpora
Enable data-driven discovery of research areas without predefined categories
Reveal semantic structure and evolution across diverse textual datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nested density clustering for hierarchical semantic trees
Relaxed density merging from dense to diffuse clusters
LLM embeddings enable data-driven discovery without categories
🔎 Similar Papers
No similar papers found.
T
Thomas Haschka
E020-04 Service Unit of High Performance Computing, DataLab, Campus IT, Technische Universität Wien, Operngasse 11, Vienna, 1040, Vienna, Austria
Joseph Bakarji
Joseph Bakarji
American University of Beirut
Nonlinear Dynamics and ChaosMachine LearningComplex Systems