🤖 AI Summary
Existing text generation evaluation metrics, such as BLEU and BERTScore, struggle to effectively assess semantic fidelity and often overlook critical errors like content omissions or factual inconsistencies. This work proposes a reference-free, multidimensional evaluation framework that introduces a cross-examination mechanism into generation assessment for the first time: treating the source and generated texts as independent knowledge bases, it performs question-answer–based mutual validation by generating verifiable questions from each to interrogate the other. The method yields three interpretable scores—coverage, consistency, and conformity—without requiring reference texts, enabling precise detection of semantic distortions. It significantly outperforms conventional metrics across translation, summarization, and clinical note tasks, effectively capturing errors at the entity and relational levels. Its reference-free and reference-based modes exhibit strong correlation, and expert validation confirms that mismatched questions align closely with actual semantic errors.
📝 Abstract
Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and candidate as independent knowledge bases. CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency. Validated across translation, summarization and clinical note-generation, our framework identifies critical errors, such as content omissions and factual contradictions, missed by standard metrics. A key contribution is a systematic robustness analysis to select a stable judge model. Crucially, the strong correlation between our reference-free and with-reference modes validates CEF's reliability without gold references. Furthermore, human expert validation demonstrates that CEF mismatching questions align with meaning-altering semantic errors higher than with non-semantic errors, particularly excelling at identifying entity-based and relational distortions.