MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) can serve as intelligent agents to substitute for humans in meetings, addressing practical challenges including high time costs, scheduling conflicts, and low participation efficiency. To this end, we introduce the first LLM-based meeting agent benchmark grounded in real meeting transcripts, formally define the meeting agent task, and propose a multi-dimensional evaluation framework covering timely intervention, key-point responsiveness, and strategic balance. Our methodology integrates transcript error correction, salient point extraction, and participation strategy modeling, and we systematically evaluate GPT-4/4o, Gemini 1.5 Pro/Flash, and Llama3 variants. Results show that GPT-4/4o achieves the best overall performance—covering ~60% of critical points—but suffers from redundant generation and poor robustness to transcription errors. User studies with domain practitioners validate the framework’s practical utility, establishing a reproducible benchmark and methodological foundation for LLM-augmented meeting collaboration.

Technology Category

Application Category

📝 Abstract

In contemporary workplaces, meetings are essential for exchanging ideas and ensuring team alignment but often face challenges such as time consumption, scheduling conflicts, and inefficient participation. Recent advancements in Large Language Models (LLMs) have demonstrated their strong capabilities in natural language generation and reasoning, prompting the question: can LLMs effectively delegate participants in meetings? To explore this, we develop a prototype LLM-powered meeting delegate system and create a comprehensive benchmark using real meeting transcripts. Our evaluation reveals that GPT-4/4o maintain balanced performance between active and cautious engagement strategies. In contrast, Gemini 1.5 Pro tends to be more cautious, while Gemini 1.5 Flash and Llama3-8B/70B display more active tendencies. Overall, about 60% of responses address at least one key point from the ground-truth. However, improvements are needed to reduce irrelevant or repetitive content and enhance tolerance for transcription errors commonly found in real-world settings. Additionally, we implement the system in practical settings and collect real-world feedback from demos. Our findings underscore the potential and challenges of utilizing LLMs as meeting delegates, offering valuable insights into their practical application for alleviating the burden of meetings.

Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' effectiveness in meeting delegation

Develop benchmark using real meeting transcripts

Evaluate LLMs' engagement strategies and response accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-powered meeting delegate system

Benchmark using real meeting transcripts

Evaluation of active vs. cautious engagement

🔎 Similar Papers

Leveraging Large Language Models for Collective Decision-Making