EvoTaxo: Building and Evolving Taxonomy from Social Media Streams

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges posed by noisy, semantically heterogeneous, and dynamically evolving short texts in social media streams by proposing EvoTaxo, a novel framework that introduces the first large language model (LLM)-driven mechanism for dynamic taxonomy evolution. The approach transforms social media posts into structured edit drafts and accumulates evidence over temporal windows to generate candidate taxonomy updates through a dual-view clustering strategy that integrates semantic similarity and temporal locality. A refinement–arbitration mechanism coupled with a concept memory bank ensures stable semantic boundaries across revisions. Experiments on Reddit datasets demonstrate that EvoTaxo produces taxonomies with more balanced structures, clearer leaf-node semantics, and broader corpus coverage, effectively capturing temporal shifts in discourse topics—such as those observed in communities like /r/ICE_Raids.

Technology Category

Application Category

📝 Abstract
Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.
Problem

Research questions and friction points this paper is trying to address.

taxonomy induction
social media streams
temporal dynamics
noisy short texts
evolving discourse
Innovation

Methods, ideas, or system contributions that make the work stand out.

taxonomy induction
LLM-based framework
temporal dynamics
dual-view clustering
concept memory bank
🔎 Similar Papers
No similar papers found.