No Gold Standard, No Problem: Reference-Free Evaluation of Taxonomies

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenge of taxonomy quality assessment in the absence of gold-standard annotations. We propose two novel reference-free evaluation metrics: (1) a robustness measure based on the correlation between semantic similarity and hierarchical distance, designed to detect semantic-structural inconsistencies; and (2) a logical sufficiency assessment grounded in natural language inference (NLI), which verifies entailment relationships between parent and child concepts. To our knowledge, this is the first approach enabling fully unsupervised, quantitative taxonomy quality evaluation—bridging critical gaps in existing methods regarding semantic-structural alignment and logical consistency validation. Experiments across five real-world taxonomies demonstrate that both metrics exhibit strong correlation with gold-standard F1 scores (Spearman ρ > 0.85), significantly enhancing the reliability and interpretability of unsupervised taxonomy evaluation.

Technology Category

Application Category

📝 Abstract

We introduce two reference-free metrics for quality evaluation of taxonomies. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, covering a type of error not handled by existing metrics. The second uses Natural Language Inference to assess logical adequacy. Both metrics are tested on five taxonomies and are shown to correlate well with F1 against gold-standard taxonomies.

Problem

Research questions and friction points this paper is trying to address.

Evaluating taxonomy quality without gold standards

Measuring robustness via semantic-taxonomic correlation

Assessing logical adequacy using Natural Language Inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free metrics for taxonomy quality evaluation

Semantic-taxonomic correlation for robustness assessment

Natural Language Inference for logical adequacy

🔎 Similar Papers

$T^5Score$: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets