🤖 AI Summary
This work investigates how external knowledge integration affects the explainability of commonsense-oriented natural language generation (NLG). To address the limitation of existing evaluations—overreliance on superficial metrics—we propose a three-stage explainability assessment framework and introduce KITGI, a novel benchmark that integrates ConceptNet semantic relations with human annotations to enable controlled knowledge ablation studies. Experiments on T5-Large demonstrate that full knowledge input yields 91% correctness in generated outputs, whereas removing critical knowledge drops performance to just 6%, underscoring the decisive role of external knowledge in ensuring reasoning coherence and conceptual completeness. Our key contributions are: (1) the first explainability evaluation framework specifically designed for commonsense NLG; (2) KITGI, a knowledge-sensitive benchmark enabling fine-grained diagnostic analysis; and (3) empirical evidence establishing a causal link among knowledge grounding, reasoning fidelity, and generation quality—thereby shifting evaluation paradigms from surface-level consistency toward traceable, inference-aware assessment.
📝 Abstract
This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91% correctness across both criteria, while filtering reduced performance drastically to 6%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.