🤖 AI Summary
This work addresses the challenges posed by noisy, semantically heterogeneous, and dynamically evolving short texts in social media streams by proposing EvoTaxo, a novel framework that introduces the first large language model (LLM)-driven mechanism for dynamic taxonomy evolution. The approach transforms social media posts into structured edit drafts and accumulates evidence over temporal windows to generate candidate taxonomy updates through a dual-view clustering strategy that integrates semantic similarity and temporal locality. A refinement–arbitration mechanism coupled with a concept memory bank ensures stable semantic boundaries across revisions. Experiments on Reddit datasets demonstrate that EvoTaxo produces taxonomies with more balanced structures, clearer leaf-node semantics, and broader corpus coverage, effectively capturing temporal shifts in discourse topics—such as those observed in communities like /r/ICE_Raids.
📝 Abstract
Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.