🤖 AI Summary
Current LLM-based multi-document topic extraction lacks tailored evaluation methodologies, resulting in low inter-annotator agreement (IAA) and hindering trustworthy deployment. To address this, we propose the first decomposable, quantifiable, and highly consistent framework for topic set evaluation: it disentangles topic quality into annotatable dimensions—semantic coverage, coherence, and discriminability; introduces a lightweight semantic decomposition annotation protocol coupled with a multi-dimensional quantitative scoring mechanism; and supports human, automated, and hybrid evaluation. Empirical validation across multiple benchmark datasets demonstrates significantly higher IAA than conventional metrics (e.g., F1, NMI) and strong cross-dataset robustness. Our framework establishes a reliable, reproducible foundation for evaluating LLM-generated topics, enabling rigorous, interpretable, and scalable assessment of topic extraction systems.
📝 Abstract
Using LLMs for Multi-Document Topic Extraction has recently gained popularity due to their apparent high-quality outputs, expressiveness, and ease of use. However, most existing evaluation practices are not designed for LLM-generated topics and result in low inter-annotator agreement scores, hindering the reliable use of LLMs for the task. To address this, we introduce $T^5Score$, an evaluation methodology that decomposes the quality of a topic set into quantifiable aspects, measurable through easy-to-perform annotation tasks. This framing enables a convenient, manual or automatic, evaluation procedure resulting in a strong inter-annotator agreement score. To substantiate our methodology and claims, we perform extensive experimentation on multiple datasets and report the results.