🤖 AI Summary
Existing evaluation frameworks for generative AI–produced meeting summaries lack reusability and cross-domain applicability, hindering systematic comparison and fine-grained error analysis. This work proposes the first typology-driven, reusable evaluation pipeline specifically designed for meeting summarization, comprising five stages: data ingestion, structured reference construction, candidate generation, claim-level scoring, and reporting. By decoupling the evaluation workflow from task-specific semantics and treating references and outputs as persistent artifacts, the framework enables aggregate analysis, statistical testing, and attribution. Evaluated across diverse datasets—including municipal council meetings and White House briefings—the pipeline assessed 340 model-meeting pairs from 114 meetings. Results show that GPT-4.1-mini achieves the highest factual accuracy (0.583), while GPT-5.1 excels in completeness (0.886) and coverage (0.942). Subsequent deployment confirms GPT-5.4’s comprehensive superiority over prior models, particularly in retention-oriented metrics.
📝 Abstract
We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing.
We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1.
A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path.