GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in generating video commentary art—such as humor, irony, and emotional resonance—and lack dedicated multimodal evaluation benchmarks. To address this, we introduce GODBench, the first multimodal benchmark explicitly designed for evaluating video commentary art, encompassing both video-text modalities and diverse creative dimensions. We further propose Ripple of Thought (RoT), a novel multi-step reasoning paradigm inspired by physical wave propagation, integrated with structured prompting and cross-modal alignment to enhance generative creativity. Experiments demonstrate that state-of-the-art MLLMs perform poorly on GODBench; RoT consistently improves creative quality by 23.6% on average. Both the benchmark and methodology are open-sourced to advance research in video-based creative generation.

Technology Category

Application Category

📝 Abstract
Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at https://github.com/stan-lei/GODBench-ACL2025.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to generate creative video comments
Addressing limited modalities in existing Comment Art benchmarks
Enhancing MLLM creativity via Ripple of Thought framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces GODBench for video-text multimodal evaluation
Proposes Ripple of Thought for creative reasoning
Enhances MLLM creativity in video comment art
🔎 Similar Papers
No similar papers found.
C
Chenkai Zhang
Beihang University, Hangzhou Innovation Institute, Beihang University
Y
Yiming Lei
Beihang University, Hangzhou Innovation Institute, Beihang University
Z
Zeming Liu
Beihang University
H
Haitao Leng
Kuaishou Technology
Shaoguo Liu
Shaoguo Liu
Alibaba Corporation
Maching LearningComputer Vision
T
Tingting Gao
Kuaishou Technology
Qingjie Liu
Qingjie Liu
Professor, School of Computer Science and Engineering, Beihang University
Computer Vision and Pattern Recognition
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision