ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current topic modeling and document clustering evaluation face two key bottlenecks: poor alignment between automated metrics and human judgment, and heavy reliance on costly, labor-intensive manual annotations that hinder scalability. To address these, we propose a pragmatically oriented human evaluation protocol and a scalable automated approximation method: both humans and large language model (LLM) agents perform category inference over text groups and generalize inferred categories to unseen documents—mirroring real-world usage. Through large-scale crowdsourced experiments on two benchmark datasets, we empirically demonstrate—for the first time—that the best-performing LLM agent achieves human-level performance in topic modeling and clustering evaluation, with no statistically significant difference (p > 0.05). Thus, it serves as a high-fidelity, low-cost surrogate for human assessment. This work establishes a reproducible, scalable new benchmark for evaluation paradigms in unsupervised text analysis.

Technology Category

Application Category

📝 Abstract
Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann
Problem

Research questions and friction points this paper is trying to address.

Evaluating topic models without scalable human feedback
Aligning automated metrics with human preferences in clustering
Developing LLM proxies to replace costly human annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable human evaluation protocol for models
LLM-based proxy mimics human annotators
Automated proxies validated via crowdworker annotations
🔎 Similar Papers
No similar papers found.