MeetBench-XL: Calibrated Multi-Dimensional Evaluation and Learned Dual-Policy Agents for Real-Time Meetings

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing conference benchmarks struggle to capture the complex demands of real-world enterprise settings, such as multi-stakeholder collaboration, long-context reasoning, and tool-augmented decision-making. To address this gap, this work introduces MeetAll, a multimodal bilingual dataset comprising 231 enterprise meetings, along with a multidimensional evaluation protocol, MeetBench-XL, and a dual-strategy agent, MeetMaster-XL. The proposed framework establishes the first evaluation体系 centered on four critical dimensions: cognitive load, temporal span, domain expertise, and actionable task execution. It integrates a lightweight routing mechanism to jointly optimize fast and slow reasoning pathways and orchestrates tool usage—including retrieval, cross-meeting aggregation, and web search. Experiments demonstrate that MeetMaster-XL significantly outperforms current commercial systems in factual accuracy, intent alignment, and response efficiency, achieving an optimal trade-off between output quality and latency in real-world deployment.

Technology Category

Application Category

📝 Abstract
Enterprise meeting environments require AI assistants that handle diverse operational tasks, from rapid fact checking during live discussions to cross meeting analysis for strategic planning, under strict latency, cost, and privacy constraints. Existing meeting benchmarks mainly focus on simplified question answering and fail to reflect real world enterprise workflows, where queries arise organically from multi stakeholder collaboration, span long temporal contexts, and require tool augmented reasoning. We address this gap through a grounded dataset and a learned agent framework. First, we introduce MeetAll, a bilingual and multimodal corpus derived from 231 enterprise meetings totaling 140 hours. Questions are injected using an enterprise informed protocol validated by domain expert review and human discriminability studies. Unlike purely synthetic benchmarks, this protocol is grounded in four enterprise critical dimensions: cognitive load, temporal context span, domain expertise, and actionable task execution, calibrated through interviews with stakeholders across finance, healthcare, and technology sectors. Second, we propose MeetBench XL, a multi dimensional evaluation protocol aligned with human judgment that measures factual fidelity, intent alignment, response efficiency, structural clarity, and completeness. Third, we present MeetMaster XL, a learned dual policy agent that jointly optimizes query routing between fast and slow reasoning paths and tool invocation, including retrieval, cross meeting aggregation, and web search. A lightweight classifier enables accurate routing with minimal overhead, achieving a superior quality latency tradeoff over single model baselines. Experiments against commercial systems show consistent gains, supported by ablations, robustness tests, and a real world deployment case study.Resources: https://github.com/huyuelin/MeetBench.
Problem

Research questions and friction points this paper is trying to address.

meeting benchmark
enterprise AI assistant
multi-dimensional evaluation
tool-augmented reasoning
real-time meeting
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-policy agent
multi-dimensional evaluation
tool-augmented reasoning
enterprise meeting benchmark
query routing
🔎 Similar Papers
No similar papers found.
Y
Yuelin Hu
Shanghai Jiao Tong University, China
J
Jun Xu
Shanghai Jiao Tong University, China
B
Bingcong Lu
Shanghai Jiao Tong University, China
Zhengxue Cheng
Zhengxue Cheng
Assistant Researcher, Shanghai Jiao Tong University
Video and Image CodingComputer VisionImage Quality Assessment
H
Hongwei Hu
Ant Group, China
R
Ronghua Wu
Ant Group, China
Li Song
Li Song
Professor of Electronic Engineering, Shanghai Jiao Tong University
Video CodingImage ProcessingComputer Vision