GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit limited capability in generating video commentary art—such as humor, irony, and emotional resonance—and lack dedicated multimodal evaluation benchmarks. To address this, we introduce GODBench, the first multimodal benchmark explicitly designed for evaluating video commentary art, encompassing both video-text modalities and diverse creative dimensions. We further propose Ripple of Thought (RoT), a novel multi-step reasoning paradigm inspired by physical wave propagation, integrated with structured prompting and cross-modal alignment to enhance generative creativity. Experiments demonstrate that state-of-the-art MLLMs perform poorly on GODBench; RoT consistently improves creative quality by 23.6% on average. Both the benchmark and methodology are open-sourced to advance research in video-based creative generation.

Technology Category

Application Category

📝 Abstract

Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at https://github.com/stan-lei/GODBench-ACL2025.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to generate creative video comments

Addressing limited modalities in existing Comment Art benchmarks

Enhancing MLLM creativity via Ripple of Thought framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces GODBench for video-text multimodal evaluation

Proposes Ripple of Thought for creative reasoning

Enhances MLLM creativity in video comment art

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs