Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current approaches to meeting effectiveness evaluation rely on coarse-grained post-hoc questionnaires, which fail to capture the dynamic nature of collaborative interactions and suffer from limited scalability, high cost, and poor reproducibility. This work proposes a temporally fine-grained paradigm that defines meeting effectiveness as the rate of goal attainment within topical segments and introduces an end-to-end automated evaluation framework. We present the first fine-grained annotation scheme tailored to topical segments, release AMI-ME—a new dataset comprising 2,459 human-annotated segments—and implement a fully automatic pipeline leveraging large language models (LLMs) to map raw speech directly to effectiveness scores. Experimental results demonstrate the framework’s strong generalization across diverse meeting types, establish robust baselines, and advance research in meeting analysis and multi-party dialogue systems.

Technology Category

Application Category

📝 Abstract

Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment's effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework's generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework's effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.

Problem

Research questions and friction points this paper is trying to address.

meeting effectiveness

temporal fine-grained evaluation

collaborative discussions

automatic evaluation

multi-party dialogue

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal fine-grained evaluation

meeting effectiveness

Large Language Model (LLM)

AMI-ME dataset

automatic evaluation framework

🔎 Similar Papers

Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system

2023-07-28arXiv.orgCitations: 7