Optimized Algorithms for Text Clustering with LLM-Generated Constraints

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high annotation cost and noise sensitivity of traditional constrained text clustering methods that rely on pairwise must-link/cannot-link constraints. While large language models (LLMs) offer promise for automatically generating such constraints, their practical use is hindered by excessive query costs and noisy outputs. To overcome these limitations, the authors propose an efficient LLM-driven clustering approach that replaces pairwise constraints with set-level constraints, drastically reducing the number of LLM invocations. They further design a clustering algorithm tailored to the characteristics of LLM-generated constraints and incorporate confidence thresholds and penalty mechanisms to mitigate the impact of noise. Experiments on five benchmark datasets demonstrate that the proposed method achieves clustering accuracy comparable to state-of-the-art approaches while reducing LLM query counts by over 20-fold.

Technology Category

Application Category

📝 Abstract
Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process. With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality through LLM-based automatic constraint generation. In this paper, we propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints. This approach improves both query efficiency and constraint accuracy compared to state-of-the-art methods. We further introduce a constrained clustering algorithm tailored to the characteristics of LLM-generated constraints. Our method incorporates a confidence threshold and a penalty mechanism to address potentially inaccurate constraints. We evaluate our approach on five text datasets, considering both the cost of constraint generation and the overall clustering performance. The results show that our method achieves clustering accuracy comparable to the state-of-the-art algorithms while reducing the number of LLM queries by more than 20 times.
Problem

Research questions and friction points this paper is trying to address.

text clustering
large language models
constraint generation
must-link constraints
cannot-link constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

constraint generation
text clustering
large language models
constrained clustering
query efficiency
🔎 Similar Papers
No similar papers found.
C
Chaoqi Jia
School of Accounting, Information Systems and Supply Chain, RMIT University, Melbourne, VIC 3000, Australia
W
Weihong Wu
School of Mathematics and Statistics, Fuzhou University, Fuzhou 350116, China
Longkun Guo
Longkun Guo
Fuzhou university
Algorithm design and analysisdata scienceschedulingmobile networks
Z
Zhigang Lu
Western Sydney University, NSW 2751, Australia
Chao Chen
Chao Chen
RMIT University
AI-Driven CybersecurityAI SafetyAI and Analytics for Business
K
Kok-Leong Ong
School of Accounting, Information Systems and Supply Chain, RMIT University, Melbourne, VIC 3000, Australia