🤖 AI Summary
This work addresses the limitations of existing large-scale multimodal commonsense knowledge graphs in supporting complex contextual reasoning tasks such as image captioning and story generation. For the first time, it systematically integrates visual information into the ATOMIC2020 commonsense knowledge graph by leveraging efficient image retrieval to construct over 900,000 multimodal triples that jointly encode physical, social, and event-based commonsense knowledge. This enriched resource enables joint textual and visual commonsense reasoning, significantly enhancing the richness, coherence, and contextual relevance of generated visual stories. The proposed approach overcomes the inherent constraints of purely text-based commonsense methods in multimodal scenarios, offering a more comprehensive foundation for downstream vision-language applications.
📝 Abstract
We present MMCOMET, the first multimodal commonsense knowledge graph (MMKG) that integrates physical, social, and eventive knowledge. MMCOMET extends the ATOMIC2020 knowledge graph to include a visual dimension, through an efficient image retrieval process, resulting in over 900K multimodal triples. This new resource addresses a major limitation of existing MMKGs in supporting complex reasoning tasks like image captioning and storytelling. Through a standard visual storytelling experiment, we show that our holistic approach enables the generation of richer, coherent, and contextually grounded stories than those produced using text-only knowledge. This resource establishes a new foundation for multimodal commonsense reasoning and narrative generation.