🤖 AI Summary
This work addresses the lack of reliable evaluation benchmarks for fine-grained, natural language–driven PowerPoint editing. We introduce PPTArena, the first agent-oriented benchmark for in-situ multi-element editing, covering real-world elements including text, charts, tables, animations, and slide master styles. We propose PPTPilot, a structure-aware agent that integrates semantic planning, programmatic tool invocation, and XML-level precise manipulation, operating via a plan-edit-verify闭环 to enhance accuracy and visual consistency. Innovatively, we adopt a dual-channel VLM-as-judge evaluation mechanism combining structured difference analysis with visual quality scoring. Experiments show that PPTPilot outperforms leading proprietary systems by over 10 percentage points on composite, cross-slide, and layout-sensitive tasks, significantly improving visual fidelity and document-level consistency—while revealing long-horizon editing as a critical remaining challenge.
📝 Abstract
We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.