$T^5Score$: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets

📅 2024-07-24

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Current LLM-based multi-document topic extraction lacks tailored evaluation methodologies, resulting in low inter-annotator agreement (IAA) and hindering trustworthy deployment. To address this, we propose the first decomposable, quantifiable, and highly consistent framework for topic set evaluation: it disentangles topic quality into annotatable dimensions—semantic coverage, coherence, and discriminability; introduces a lightweight semantic decomposition annotation protocol coupled with a multi-dimensional quantitative scoring mechanism; and supports human, automated, and hybrid evaluation. Empirical validation across multiple benchmark datasets demonstrates significantly higher IAA than conventional metrics (e.g., F1, NMI) and strong cross-dataset robustness. Our framework establishes a reliable, reproducible foundation for evaluating LLM-generated topics, enabling rigorous, interpretable, and scalable assessment of topic extraction systems.

Technology Category

Application Category

📝 Abstract

Using LLMs for Multi-Document Topic Extraction has recently gained popularity due to their apparent high-quality outputs, expressiveness, and ease of use. However, most existing evaluation practices are not designed for LLM-generated topics and result in low inter-annotator agreement scores, hindering the reliable use of LLMs for the task. To address this, we introduce $T^5Score$, an evaluation methodology that decomposes the quality of a topic set into quantifiable aspects, measurable through easy-to-perform annotation tasks. This framing enables a convenient, manual or automatic, evaluation procedure resulting in a strong inter-annotator agreement score. To substantiate our methodology and claims, we perform extensive experimentation on multiple datasets and report the results.

Problem

Research questions and friction points this paper is trying to address.

Assessing quality of LLM-generated multi-document topic sets

Low inter-annotator agreement in existing evaluation practices

Introducing T5Score for quantifiable and reliable topic evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes topic quality into quantifiable aspects

Enables manual or automatic evaluation procedures

Achieves strong inter-annotator agreement scores

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models