🤖 AI Summary
Current approaches to meeting effectiveness evaluation rely on coarse-grained post-hoc questionnaires, which fail to capture the dynamic nature of collaborative interactions and suffer from limited scalability, high cost, and poor reproducibility. This work proposes a temporally fine-grained paradigm that defines meeting effectiveness as the rate of goal attainment within topical segments and introduces an end-to-end automated evaluation framework. We present the first fine-grained annotation scheme tailored to topical segments, release AMI-ME—a new dataset comprising 2,459 human-annotated segments—and implement a fully automatic pipeline leveraging large language models (LLMs) to map raw speech directly to effectiveness scores. Experimental results demonstrate the framework’s strong generalization across diverse meeting types, establish robust baselines, and advance research in meeting analysis and multi-party dialogue systems.
📝 Abstract
Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment's effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework's generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework's effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.