ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current topic modeling and document clustering evaluation face two key bottlenecks: poor alignment between automated metrics and human judgment, and heavy reliance on costly, labor-intensive manual annotations that hinder scalability. To address these, we propose a pragmatically oriented human evaluation protocol and a scalable automated approximation method: both humans and large language model (LLM) agents perform category inference over text groups and generalize inferred categories to unseen documents—mirroring real-world usage. Through large-scale crowdsourced experiments on two benchmark datasets, we empirically demonstrate—for the first time—that the best-performing LLM agent achieves human-level performance in topic modeling and clustering evaluation, with no statistically significant difference (p > 0.05). Thus, it serves as a high-fidelity, low-cost surrogate for human assessment. This work establishes a reproducible, scalable new benchmark for evaluation paradigms in unsupervised text analysis.

Technology Category

Application Category

📝 Abstract

Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann

Problem

Research questions and friction points this paper is trying to address.

Evaluating topic models without scalable human feedback

Aligning automated metrics with human preferences in clustering

Developing LLM proxies to replace costly human annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable human evaluation protocol for models

LLM-based proxy mimics human annotators

Automated proxies validated via crowdworker annotations

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models