Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of creative generation capabilities in multimodal large language models (MLLMs) for real-world image-driven tasks. We introduce CreativeMLLM-Bench, the first dedicated benchmark for assessing MLLM creativity, comprising 765 instances across 51 fine-grained creative scenarios. Methodologically, we propose the first instance-level visual-semantic consistency metric, integrating multimodal prompt engineering, task-adaptive evaluation criteria, and quantitative vision-language alignment analysis. Key contributions include: (1) uncovering a novel phenomenon wherein visual fine-tuning may degrade the intrinsic creativity of base language models; (2) releasing the first open-source, reproducible PyTorch/Python evaluation framework; and (3) empirically demonstrating that leading open-source MLLMs significantly underperform closed-source counterparts on creative tasks. CreativeMLLM-Bench is publicly available to advance measurable progress in multimodal generative intelligence.

Technology Category

Application Category

📝 Abstract

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.

Problem

Research questions and friction points this paper is trying to address.

Assessing creative intelligence in Multimodal Large Language Models (MLLMs).

Evaluating MLLMs in real-world, image-based creative tasks.

Analyzing impact of visual fine-tuning on MLLM creativity.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Creation-MMBench evaluates MLLM creativity

765 test cases across 51 tasks

Visual fine-tuning impacts LLM creativity

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models