AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses a critical gap in current unified multimodal models, which exhibit significant deficiencies in leveraging world knowledge across diverse tasks, while existing evaluation benchmarks remain limited to isolated single-task assessments and lack diagnostic capability. To this end, we propose AEGIS—the first multidimensional unified evaluation framework specifically designed to assess world knowledge proficiency. AEGIS encompasses four core tasks—visual understanding, generation, editing, and interleaved generation—and includes 1,050 human-annotated questions spanning 21 themes and six reasoning types. We further introduce a Deterministic Checklist Evaluation (DCE) protocol based on atomic yes/no judgments to enhance assessment reliability. Experiments reveal substantial performance degradation in state-of-the-art models under complex reasoning scenarios, though integrating simple reasoning modules partially mitigates this issue, offering actionable insights for future model improvement.

Technology Category

Application Category

📝 Abstract

The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N''judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.

Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models

world knowledge

multimodal benchmark

knowledge-based reasoning

model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Models

World Knowledge

Multi-task Benchmark