🤖 AI Summary
This work investigates the cross-domain generalization capability of multimodal chain-of-thought (Multimodal-CoT) reasoning on non-scientific visual question answering (VQA) tasks requiring commonsense and world knowledge—namely A-OKVQA, OKVQA, and ChartQA. We extend Zhang et al.’s two-stage framework by employing T5 to generate explicit reasoning rationales and introducing a gated visual fusion mechanism to better align visual features with linguistic reasoning. Systematic ablation studies validate the design. Results show that explicit visual fusion significantly mitigates hallucination in reasoning and improves answer reliability; however, performance varies markedly across question types, with commonsense reasoning remaining the fundamental bottleneck. To our knowledge, this is the first systematic study to reveal the critical role of visual information in suppressing hallucination within Multimodal-CoT, clarify current limitations in cross-domain commonsense reasoning, and provide empirical evidence and practical guidelines for designing robust multimodal reasoning systems.
📝 Abstract
While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.