Cost-Effective Text Clustering with Large Language Models

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
To address the high API invocation costs and limited query budgets in LLM-based text clustering, this paper proposes TECL—a novel end-to-end unsupervised clustering framework. TECL introduces two pioneering strategies—EdgeLLM and TriangleLLM—to actively select highly informative text pairs and triplets, respectively. It employs customized prompts to precisely extract must-link and cannot-link constraints from LLM responses. By integrating active learning, constraint-based clustering, prompt engineering, and weighted graph optimization, TECL effectively leverages scarce LLM queries to guide clustering. Under strict query budget constraints, TECL achieves substantial improvements in clustering accuracy: on multiple benchmark datasets, it attains average gains of 12.6% in both Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI), outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Text clustering aims to automatically partition a collection of text documents into distinct clusters based on linguistic features. In the literature, this task is usually framed as metric clustering based on text embeddings from pre-trained encoders or a graph clustering problem upon pairwise similarities from an oracle, e.g., a large ML model. Recently, large language models (LLMs) bring significant advancement in this field by offering contextualized text embeddings and highly accurate similarity scores, but meanwhile, present grand challenges to cope with substantial computational and/or financial overhead caused by numerous API-based queries or inference calls to the models. In response, this paper proposes TECL, a cost-effective framework that taps into the feedback from LLMs for accurate text clustering within a limited budget of queries to LLMs. Under the hood, TECL adopts our EdgeLLM or TriangleLLM to construct must-link/cannot-link constraints for text pairs, and further leverages such constraints as supervision signals input to our weighted constrained clustering approach to generate clusters. Particularly, EdgeLLM (resp. TriangleLLM) enables the identification of informative text pairs (resp. triplets) for querying LLMs via well-thought-out greedy algorithms and accurate extraction of pairwise constraints through carefully-crafted prompting techniques. Our experiments on multiple benchmark datasets exhibit that TECL consistently and considerably outperforms existing solutions in unsupervised text clustering under the same query cost for LLMs.
Problem

Research questions and friction points this paper is trying to address.

Cost-effective text clustering using LLMs with limited queries
Reducing computational and financial overhead in LLM-based clustering
Improving clustering accuracy via LLM feedback and constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

EdgeLLM identifies informative text pairs
TriangleLLM extracts pairwise constraints accurately
Weighted constrained clustering optimizes LLM queries
🔎 Similar Papers
2024-09-30arXiv.orgCitations: 0