🤖 AI Summary
Despite growing interest in native multimodal generation, the capabilities and limitations of large language models (LLMs) like GPT-4o (mini) in image generation remain poorly characterized, particularly across spatial, temporal, commonsense, and knowledge-intensive tasks.
Method: We conduct a systematic evaluation using a novel multimodal image generation benchmark—covering generation, discrimination, and perception across six dimensions—and combine human-curated test sets with qualitative analysis.
Contribution/Results: Our framework is the first to jointly assess spatial reasoning, temporal consistency, commonsense grounding, and domain-specific knowledge (e.g., scientific illustration). We find that while GPT-4o excels at text-to-image synthesis, style transfer, and low-level processing, it consistently exhibits hallucinations, factual inaccuracies, and structural inconsistencies in tasks demanding rigorous real-world modeling or logical constraints—especially in precise spatial layout, instruction alignment, and sequential coherence. This work establishes the first capability boundary characterization of unified multimodal LLMs for image generation, providing a foundational benchmark and research direction for controllable, trustworthy multimodal synthesis.
📝 Abstract
Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.