🤖 AI Summary
While current multimodal large language models (MLLMs) excel at object recognition, they lack human-like understanding of physical and social regularities grounded in visual perception.
Method: This work introduces the concept of “visual knowledge” and establishes VKnowU—the first video-based benchmark (1680 questions) jointly oriented toward world-centric and human-centric knowledge—alongside the See-Think-Answer reasoning paradigm. We further propose a visual-knowledge-guided reward function and train VideoKnow+, leveraging the multimodal visual question-answering dataset VKnowQA.
Contribution/Results: Experiments show that VideoKnow+ achieves a +3.7% gain on VKnowU and delivers consistent improvements across general-purpose video-understanding benchmarks—including MVBench, Video-MME, and MMVU—significantly narrowing the gap between model and human performance in visual knowledge comprehension. This is the first systematic effort to both evaluate and enhance the deep visual cognitive capabilities of MLLMs.
📝 Abstract
While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.