🤖 AI Summary
Prior work lacks systematic evaluation of unified multimodal generation capabilities in state-of-the-art large multimodal models (LMMs) across diverse cross-modal image generation tasks (e.g., text-to-image, image-to-image, image-to-3D, image-to-X).
Method: This study conducts the first comprehensive benchmark of GPT-4o across 20+ cross-modal generation tasks, employing multi-task prompt engineering, cross-modal consistency assessment, and a hybrid human–automated evaluation framework integrating quantitative metrics and qualitative analysis.
Contribution/Results: GPT-4o demonstrates superior text–image co-generation performance versus mainstream multimodal models, reflecting strong semantic understanding and cross-modal alignment. However, it lags significantly behind specialized diffusion models in fine-grained spatial control, 3D geometric fidelity, and complex image editing—positioning its overall generative capability between domain-specific models and earlier LMMs. The study empirically identifies data scale and architectural design as critical determinants of generation quality under unified architectures, establishing the first evidence-based benchmark and methodological paradigm for evaluating generative capabilities of multimodal foundation models.
📝 Abstract
The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.