AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

📅 2026-01-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in current unified multimodal models, which exhibit significant deficiencies in leveraging world knowledge across diverse tasks, while existing evaluation benchmarks remain limited to isolated single-task assessments and lack diagnostic capability. To this end, we propose AEGIS—the first multidimensional unified evaluation framework specifically designed to assess world knowledge proficiency. AEGIS encompasses four core tasks—visual understanding, generation, editing, and interleaved generation—and includes 1,050 human-annotated questions spanning 21 themes and six reasoning types. We further introduce a Deterministic Checklist Evaluation (DCE) protocol based on atomic yes/no judgments to enhance assessment reliability. Experiments reveal substantial performance degradation in state-of-the-art models under complex reasoning scenarios, though integrating simple reasoning modules partially mitigates this issue, offering actionable insights for future model improvement.

Technology Category

Application Category

📝 Abstract
The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N''judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.
Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models
world knowledge
multimodal benchmark
knowledge-based reasoning
model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Models
World Knowledge
Multi-task Benchmark
Deterministic Checklist-based Evaluation
Complex Reasoning
🔎 Similar Papers
No similar papers found.
J
Jintao Lin
University of Hong Kong
B
Bowen Dong
The Hong Kong Polytechnic University
W
Weikang Shi
The Chinese University of Hong Kong
Chenyang Lei
Chenyang Lei
Princeton University
Computational PhotographyComputer VisionGenerative AI
S
Suiyun Zhang
Huawei Research
R
Rui Liu
Huawei Research
Xihui Liu
Xihui Liu
University of Hong Kong, UC Berkeley, CUHK, Tsinghua University
Computer VisionDeep Learning