Evaluating the Evaluation of Diversity in Commonsense Generation

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing diversity evaluation in commonsense generation lacks reliable metrics; formal metrics (e.g., Distinct-n, Self-BLEU) overestimate diversity, while semantic metrics remain empirically unvalidated. Method: We conduct the first meta-evaluation of diversity metrics for commonsense generation, constructing the first semantic-level sentence diversity benchmark—Curated via LLM-assisted human annotation—and systematically comparing formal and semantic metrics against human- and LLM-derived diversity judgments. Contribution/Results: Empirical results show that semantic metrics—particularly BERTScore and STS-based similarity—exhibit strong correlation (r > 0.8) with LLM-based diversity ratings, significantly outperforming formal metrics, which assign spuriously high scores to random sentences. We thus establish semantic-aware evaluation as the superior paradigm, providing a reproducible, empirically grounded assessment standard for commonsense generation research.

Technology Category

Application Category

📝 Abstract

In commonsense generation, given a set of input concepts, a model must generate a response that is not only commonsense bearing, but also capturing multiple diverse viewpoints. Numerous evaluation metrics based on form- and content-level overlap have been proposed in prior work for evaluating the diversity of a commonsense generation model. However, it remains unclear as to which metrics are best suited for evaluating the diversity in commonsense generation. To address this gap, we conduct a systematic meta-evaluation of diversity metrics for commonsense generation. We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets, where even randomly generated sentences are assigned overly high diversity scores. We then use an Large Language Model (LLM) to create a novel dataset annotated for the diversity of sentences generated for a commonsense generation task, and use it to conduct a meta-evaluation of the existing diversity evaluation metrics. Our experimental results show that content-based diversity evaluation metrics consistently outperform the form-based counterparts, showing high correlations with the LLM-based ratings. We recommend that future work on commonsense generation should use content-based metrics for evaluating the diversity of their outputs.

Problem

Research questions and friction points this paper is trying to address.

Assessing suitability of diversity metrics in commonsense generation

Identifying overestimation issues in form-based diversity metrics

Recommending content-based metrics for accurate diversity evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic meta-evaluation of diversity metrics

LLM-annotated dataset for diversity assessment

Content-based metrics outperform form-based ones

🔎 Similar Papers

No similar papers found.