🤖 AI Summary
This work addresses a critical gap in current unified multimodal models, which exhibit significant deficiencies in leveraging world knowledge across diverse tasks, while existing evaluation benchmarks remain limited to isolated single-task assessments and lack diagnostic capability. To this end, we propose AEGIS—the first multidimensional unified evaluation framework specifically designed to assess world knowledge proficiency. AEGIS encompasses four core tasks—visual understanding, generation, editing, and interleaved generation—and includes 1,050 human-annotated questions spanning 21 themes and six reasoning types. We further introduce a Deterministic Checklist Evaluation (DCE) protocol based on atomic yes/no judgments to enhance assessment reliability. Experiments reveal substantial performance degradation in state-of-the-art models under complex reasoning scenarios, though integrating simple reasoning modules partially mitigates this issue, offering actionable insights for future model improvement.
📝 Abstract
The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N''judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.