Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work investigates the cross-domain generalization capability of multimodal chain-of-thought (Multimodal-CoT) reasoning on non-scientific visual question answering (VQA) tasks requiring commonsense and world knowledge—namely A-OKVQA, OKVQA, and ChartQA. We extend Zhang et al.’s two-stage framework by employing T5 to generate explicit reasoning rationales and introducing a gated visual fusion mechanism to better align visual features with linguistic reasoning. Systematic ablation studies validate the design. Results show that explicit visual fusion significantly mitigates hallucination in reasoning and improves answer reliability; however, performance varies markedly across question types, with commonsense reasoning remaining the fundamental bottleneck. To our knowledge, this is the first systematic study to reveal the critical role of visual information in suppressing hallucination within Multimodal-CoT, clarify current limitations in cross-domain commonsense reasoning, and provide empirical evidence and practical guidelines for designing robust multimodal reasoning systems.

Technology Category

Application Category

📝 Abstract

While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.

Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal reasoning generalization across diverse question-answering domains

Analyzes vision integration impact on reducing hallucination in rationale generation

Identifies challenges in commonsense reasoning within cross-domain CoT frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework separates rationale generation

Gated fusion integrates vision features with T5

Systematic ablation studies analyze architectural choices

🔎 Similar Papers

No similar papers found.