MeetBench-XL: Calibrated Multi-Dimensional Evaluation and Learned Dual-Policy Agents for Real-Time Meetings

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing conference benchmarks struggle to capture the complex demands of real-world enterprise settings, such as multi-stakeholder collaboration, long-context reasoning, and tool-augmented decision-making. To address this gap, this work introduces MeetAll, a multimodal bilingual dataset comprising 231 enterprise meetings, along with a multidimensional evaluation protocol, MeetBench-XL, and a dual-strategy agent, MeetMaster-XL. The proposed framework establishes the first evaluation体系 centered on four critical dimensions: cognitive load, temporal span, domain expertise, and actionable task execution. It integrates a lightweight routing mechanism to jointly optimize fast and slow reasoning pathways and orchestrates tool usage—including retrieval, cross-meeting aggregation, and web search. Experiments demonstrate that MeetMaster-XL significantly outperforms current commercial systems in factual accuracy, intent alignment, and response efficiency, achieving an optimal trade-off between output quality and latency in real-world deployment.

Technology Category

Application Category

📝 Abstract

Enterprise meeting environments require AI assistants that handle diverse operational tasks, from rapid fact checking during live discussions to cross meeting analysis for strategic planning, under strict latency, cost, and privacy constraints. Existing meeting benchmarks mainly focus on simplified question answering and fail to reflect real world enterprise workflows, where queries arise organically from multi stakeholder collaboration, span long temporal contexts, and require tool augmented reasoning. We address this gap through a grounded dataset and a learned agent framework. First, we introduce MeetAll, a bilingual and multimodal corpus derived from 231 enterprise meetings totaling 140 hours. Questions are injected using an enterprise informed protocol validated by domain expert review and human discriminability studies. Unlike purely synthetic benchmarks, this protocol is grounded in four enterprise critical dimensions: cognitive load, temporal context span, domain expertise, and actionable task execution, calibrated through interviews with stakeholders across finance, healthcare, and technology sectors. Second, we propose MeetBench XL, a multi dimensional evaluation protocol aligned with human judgment that measures factual fidelity, intent alignment, response efficiency, structural clarity, and completeness. Third, we present MeetMaster XL, a learned dual policy agent that jointly optimizes query routing between fast and slow reasoning paths and tool invocation, including retrieval, cross meeting aggregation, and web search. A lightweight classifier enables accurate routing with minimal overhead, achieving a superior quality latency tradeoff over single model baselines. Experiments against commercial systems show consistent gains, supported by ablations, robustness tests, and a real world deployment case study.Resources: https://github.com/huyuelin/MeetBench.

Problem

Research questions and friction points this paper is trying to address.

meeting benchmark

enterprise AI assistant

multi-dimensional evaluation

tool-augmented reasoning

real-time meeting

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-policy agent

multi-dimensional evaluation

tool-augmented reasoning

enterprise meeting benchmark

query routing

🔎 Similar Papers

No similar papers found.

Authors to Follow