Balancing Complexity and Informativeness in LLM-Based Clustering: Finding the Goldilocks Zone

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This paper addresses the challenge of determining the optimal number of clusters in LLM-driven short-text clustering. We propose a “golden interval” of 16–22 clusters, balancing semantic richness and interpretability. Grounded in linguistic communication efficiency theory, we design a multidimensional evaluation framework integrating semantic density, information entropy, and classification accuracy. Our method employs LLMs to generate semantic embeddings and cluster labels, applies Gaussian Mixture Model (GMM) clustering, and incorporates logistic regression for attribution analysis alongside generative interpretability assessment. Experiments on biographical text data demonstrate that this interval achieves the best trade-off between semantic discriminability and label interpretability; interpretability degrades markedly beyond 22 clusters, while GMM significantly outperforms baselines in semantic density. To our knowledge, this is the first work to systematically apply cognitive linguistic efficiency principles to optimize granularity in unsupervised text clustering.

Technology Category

Application Category

📝 Abstract

The challenge of clustering short text data lies in balancing informativeness with interpretability. Traditional evaluation metrics often overlook this trade-off. Inspired by linguistic principles of communicative efficiency, this paper investigates the optimal number of clusters by quantifying the trade-off between informativeness and cognitive simplicity. We use large language models (LLMs) to generate cluster names and evaluate their effectiveness through semantic density, information theory, and clustering accuracy. Our results show that Gaussian Mixture Model (GMM) clustering on embeddings generated by a LLM, increases semantic density compared to random assignment, effectively grouping similar bios. However, as clusters increase, interpretability declines, as measured by a generative LLM's ability to correctly assign bios based on cluster names. A logistic regression analysis confirms that classification accuracy depends on the semantic similarity between bios and their assigned cluster names, as well as their distinction from alternatives. These findings reveal a"Goldilocks zone"where clusters remain distinct yet interpretable. We identify an optimal range of 16-22 clusters, paralleling linguistic efficiency in lexical categorization. These insights inform both theoretical models and practical applications, guiding future research toward optimising cluster interpretability and usefulness.

Problem

Research questions and friction points this paper is trying to address.

Balancing informativeness and interpretability in text clustering

Determining optimal cluster count using semantic metrics

Evaluating LLM-generated cluster names for effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM for cluster name generation

Employs Gaussian Mixture Model on embeddings

Identifies optimal 16-22 cluster range

🔎 Similar Papers

Text Clustering as Classification with LLMs