M^2ConceptBase: A Fine-Grained Aligned Concept-Centric Multimodal Knowledge Base

📅 2023-12-16
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing knowledge bases exhibit coarse-grained alignment between images and textual descriptions, hindering multimodal models’ understanding of fine-grained visual concepts. Method: We introduce VLConceptKB—the first vision-language knowledge base centered on conceptual semantics—comprising 152K fine-grained concepts and 951K precisely aligned image–text triples, enabling concept-level cross-modal semantic grounding. We propose a “concept-centric” modeling paradigm and design a context-aware multimodal symbolic grounding method, achieving >95% alignment accuracy among concepts, images, and descriptions; we further establish a joint construction and human verification framework. Contribution/Results: Experiments demonstrate that VLConceptKB significantly improves VQA model performance on OK-VQA and effectively enhances multimodal large models’ capabilities in fine-grained concept understanding via retrieval augmentation.
📝 Abstract
Multimodal knowledge bases (MMKBs) provide cross-modal aligned knowledge crucial for multimodal tasks. However, the images in existing MMKBs are generally collected for entities in encyclopedia knowledge graphs. Therefore, detailed groundings of visual semantics with linguistic concepts are lacking, which are essential for the visual concept cognition ability of multimodal models. Addressing this gap, we introduce M^2ConceptBase, the first concept-centric MMKB. M^2ConceptBase models concepts as nodes with associated images and detailed textual descriptions. We propose a context-aware multimodal symbol grounding approach to align concept-image and concept-description pairs using context information from image-text datasets. Comprising 951K images and 152K concepts, M^2ConceptBase links each concept to an average of 6.27 images and a single description, ensuring comprehensive visual and textual semantics. Human studies confirm more than 95% alignment accuracy, underscoring its quality. Additionally, our experiments demonstrate that M^2ConceptBase significantly enhances VQA model performance on the OK-VQA task. M^2ConceptBase also substantially improves the fine-grained concept understanding capabilities of multimodal large language models through retrieval augmentation in two concept-related tasks, highlighting its value.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Base
Image-Text Alignment
Multimodal Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

M^2ConceptBase
Concept-centric Knowledge Base
Multimodal Integration
🔎 Similar Papers
No similar papers found.
Z
Zhiwei Zha
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China
Jiaan Wang
Jiaan Wang
WeChat AI, Tencent
Natural Language ProcessingMachine TranslationInformation Systems
Z
Zhixu Li
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China
Xiangru Zhu
Xiangru Zhu
Fudan University
cross-modal alignmentmulti-modal understandingmulti-modal generation
W
Wei Song
Research Center for Intelligent Robotics, Zhejiang Lab, Hangzhou, China
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China