🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in generating video commentary art—such as humor, irony, and emotional resonance—and lack dedicated multimodal evaluation benchmarks. To address this, we introduce GODBench, the first multimodal benchmark explicitly designed for evaluating video commentary art, encompassing both video-text modalities and diverse creative dimensions. We further propose Ripple of Thought (RoT), a novel multi-step reasoning paradigm inspired by physical wave propagation, integrated with structured prompting and cross-modal alignment to enhance generative creativity. Experiments demonstrate that state-of-the-art MLLMs perform poorly on GODBench; RoT consistently improves creative quality by 23.6% on average. Both the benchmark and methodology are open-sourced to advance research in video-based creative generation.
📝 Abstract
Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at https://github.com/stan-lei/GODBench-ACL2025.