Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

📅 2024-03-01
🏛️ arXiv.org
📈 Citations: 26
Influential: 2
📄 PDF
🤖 AI Summary
A lack of standardized, reproducible methods for quantifying textual diversity in large language models (LLMs) hinders rigorous evaluation of generation quality and cross-model or cross-corpus comparisons. Method: We propose the first systematic framework for text diversity evaluation, empirically validating convergent validity of diversity metrics and identifying a minimal, complete metric set—comprising compression ratio (zlib/lz4), long n-gram self-repetition rate, Self-BLEU, and BERTScore—that exhibits low inter-metric correlation and complementary multidimensional coverage. Contribution/Results: We release *diversity*, an open-source Python library enabling efficient computation and interactive visualization. Empirical analysis demonstrates that lightweight compression-based metrics robustly substitute for computationally expensive n-gram homogeneity scores. The framework substantially enhances interpretability, comparability, and practical utility of diversity assessment in LLM research.

Technology Category

Application Category

📝 Abstract
The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and we release diversity, an open-source Python package for measuring and extracting repetition in text. We also build a platform based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams, and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other.
Problem

Research questions and friction points this paper is trying to address.

Standardizing text diversity measurement lacks a universal method
Identifying repetitive structures in large text corpora is challenging
Evaluating convergent validity of existing diversity scores is needed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source Python package for text diversity
Platform for interactive text repetition exploration
Combination of fast compression and n-gram measures