Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the common limitations of unsupervised text clustering—namely, semantically incoherent, redundant, and poorly interpretable clusters, coupled with a lack of effective validation mechanisms. The authors propose a three-stage reasoning framework leveraging large language models (LLMs) to semantically validate and reconstruct any given clustering result without requiring labeled data. The framework sequentially performs coherence checking, redundancy adjudication, and unsupervised label generation. Innovatively treating the LLM as a semantic adjudicator rather than an embedding generator, the approach decouples representation learning from structural validation. Experiments on two real-world social media datasets demonstrate that the method substantially improves cluster coherence and human alignment of generated labels, with manual evaluations strongly endorsing label quality and cross-platform robustness.
📝 Abstract
Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.
Problem

Research questions and friction points this paper is trying to address.

unsupervised text clustering
cluster coherence
semantic redundancy
label interpretability
validation without supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning-based refinement
large language models
unsupervised text clustering
semantic validation
interpretable labeling
🔎 Similar Papers
2024-09-30arXiv.orgCitations: 0