🤖 AI Summary
Existing automated academic presentation systems suffer from narrative discontinuity, suboptimal visual aesthetics, and absence of self-improvement capabilities. This paper introduces PresAesth—the first multi-task reinforcement learning framework for aesthetic modeling in academic presentations—capable of aesthetic scoring, defect correction, and iterative self-refinement driven by comparative feedback, even under data scarcity. We propose EvoPresent Benchmark, the first evaluation benchmark jointly quantifying content quality and aesthetic perception. Furthermore, we design an end-to-end slide generation pipeline integrating narrative generation, virtual presenter embodiment, and closed-loop aesthetic optimization. Experiments demonstrate that high-quality comparative feedback significantly enhances presentation quality, uncovering fundamental trade-offs between content fidelity and visual design. Multi-task RL exhibits superior generalizability in aesthetic modeling compared to single-task alternatives.
📝 Abstract
The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: emph{there is no way to improve it when you cannot evaluate it right}. To address this, we introduce extbf{EvoPresent}, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is extbf{PresAesth}, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce extbf{EvoPresent Benchmark}, a comprehensive benchmark comprising: extit{Presentation Generation Quality}, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and extit{Aesthetic Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.