VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

While current multimodal large language models (MLLMs) excel at object recognition, they lack human-like understanding of physical and social regularities grounded in visual perception. Method: This work introduces the concept of “visual knowledge” and establishes VKnowU—the first video-based benchmark (1680 questions) jointly oriented toward world-centric and human-centric knowledge—alongside the See-Think-Answer reasoning paradigm. We further propose a visual-knowledge-guided reward function and train VideoKnow+, leveraging the multimodal visual question-answering dataset VKnowQA. Contribution/Results: Experiments show that VideoKnow+ achieves a +3.7% gain on VKnowU and delivers consistent improvements across general-purpose video-understanding benchmarks—including MVBench, Video-MME, and MMVU—significantly narrowing the gap between model and human performance in visual knowledge comprehension. This is the first systematic effort to both evaluate and enhance the deep visual cognitive capabilities of MLLMs.

Technology Category

Application Category

📝 Abstract

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

Problem

Research questions and friction points this paper is trying to address.

Evaluating visual knowledge understanding in multimodal language models

Assessing intuitive understanding of physical and social principles

Bridging perception and reasoning gaps in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed VKnowU benchmark for visual knowledge evaluation

Introduced VideoKnow+ model with See-Think-Answer paradigm

Used reinforcement learning with visual knowledge reward

🔎 Similar Papers

No similar papers found.