PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing PPT understanding benchmarks focus narrowly on isolated subtasks, overlooking the core challenge of joint visual-structural reasoning centered on layout. To address this gap, we introduce PPTBench—the first multimodal benchmark for PowerPoint layout and design understanding—constructed from 958 real-world PPTX files and comprising 4,439 samples with both visual inputs and JSON-structured annotations across four task categories: detection, comprehension, modification, and generation. Through systematic evaluation of state-of-the-art multimodal large language models (MLLMs), we uncover critical deficiencies in spatial relation modeling, precise element localization, and visual-semantic alignment—manifesting as misalignment and overlapping predictions—and demonstrate a strong correlation between layout-aware capability and API planning performance. This work fills a fundamental void in structured visual reasoning evaluation for slide-based content and establishes a new benchmark and analytical framework for joint visual-structural modeling.

Technology Category

Application Category

📝 Abstract
PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal reasoning and layout understanding in PowerPoint tasks.
Addresses the gap between semantic understanding and visual-layout reasoning.
Highlights challenges in integrating visual cues with layout structures.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PPTBench benchmark for PowerPoint layout evaluation
Uses 958 PPTX files across four task categories
Reveals gap between semantic understanding and layout reasoning
🔎 Similar Papers
No similar papers found.
Zheng Huang
Zheng Huang
NORTH DAKOTA STATE UNIVERSITY
Human Computer Interaction
Xukai Liu
Xukai Liu
University of Science and Technology of China
Knowledge GraphNatural Language Processing
Tianyu Hu
Tianyu Hu
Peking University
nlp
K
Kai Zhang
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
Y
Ye Liu
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence