🤖 AI Summary
Vision-language models (VLMs) struggle with joint perception, understanding, and normative judgment in social contexts. Method: This paper proposes a cognition-inspired, three-stage multimodal reasoning framework: (1) a perception layer extracting visual-semantic features; (2) a situational layer modeling social relationships and contextual dynamics; and (3) a normative layer integrating ethical and sociocultural principles for value-laden judgment. Unlike conventional chain-of-thought (CoT) prompting—which fails on this task—the framework employs a hierarchical prompting strategy that explicitly decouples and coordinates reasoning across stages. Contribution/Results: Evaluated on multiple social-perception multimodal benchmarks, the framework achieves an average 8% performance gain over strong baselines, significantly outperforming standard CoT and direct prompting. It also enhances interpretability and sociocultural plausibility of model reasoning without compromising accuracy.
📝 Abstract
Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.