🤖 AI Summary
This study challenges the “generation-as-understanding” assumption underlying GPT-4o’s world-knowledge-guided semantic synthesis. Method: We systematically evaluate its unified competence across three dimensions—global instruction adherence, fine-grained editing fidelity, and post-generation reasoning—using a multidimensional, human-curated benchmark. Our assessment integrates instruction robustness analysis, knowledge-constraint consistency verification, and conditional reasoning validation. Contribution/Results: We provide the first empirical evidence of GPT-4o’s fundamental deficiency in dynamic knowledge integration: it consistently exhibits literal misinterpretation, inconsistent domain-knowledge application, and failures in conditional reasoning. These findings indicate that current multimodal large language models lack a truly closed-loop generation–understanding mechanism, thereby challenging prevailing assumptions about their cross-modal semantic coherence and knowledge-driven compositional capabilities.
📝 Abstract
OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.