PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods for slide generation are overly coarse-grained and lack fine-grained, verifiable criteria, making it difficult to accurately assess model performance. To address this limitation, this work proposes PresentBench—the first fine-grained benchmark for slide generation grounded in real-world scenarios—comprising 238 evaluation instances with accompanying source materials. For each instance, human annotators designed an average of 54.1 binary judgment items, enabling a rubric-based, verifiable evaluation framework. Experimental results demonstrate that PresentBench exhibits strong alignment with human preferences and significantly outperforms existing evaluation approaches. Furthermore, the benchmark reveals that NotebookLM achieves notably strong performance on slide generation tasks.

Technology Category

Application Category

📝 Abstract
Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment. In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks. Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.
Problem

Research questions and friction points this paper is trying to address.

slide generation
evaluation benchmark
fine-grained assessment
rubric-based evaluation
automated presentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained evaluation
rubric-based benchmark
slide generation
checklist-based assessment
automated presentation
🔎 Similar Papers
No similar papers found.
X
Xin-Sheng Chen
Tsinghua University
J
Jiayu Zhu
Tsinghua University
P
Pei-lin Li
Tsinghua University
Hanzheng Wang
Hanzheng Wang
Staff Machine Vision Engineer, Tesla
Machine VisionOptical Sensors
S
Shuojin Yang
Tsinghua University
Meng-Hao Guo
Meng-Hao Guo
Postdoc, Tsinghua University
Foundation ModelsReasoningAgentComputer VisionComputer Graphics