Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) struggle with multi-choice question answering over complex-layout, long-text PDF documents—especially in Japanese—due to poor layout understanding, shallow semantic parsing, and severe English-language bias. This paper proposes a vision-language hierarchical reasoning framework that integrates a sub-question decomposition–driven semantic verification mechanism with a Colqwen-optimized cross-lingual retrieval module, enabling fine-grained layout awareness and deep semantic alignment. Key contributions are: (i) the first coupling of hierarchical reasoning with semantic verification for multimodal PDF QA; and (ii) enhanced cross-lingual representation for non-English documents via Colqwen, mitigating language bias in training data. Experiments on a Japanese ten-option PDF QA benchmark demonstrate significant improvements over state-of-the-art models, achieving a 12.3% absolute accuracy gain, alongside enhanced robustness and deployment adaptability.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses multimodal question answering in complex PDF documents
Overcomes language bias for Japanese and non-English scenarios
Enhances deep semantic parsing of lengthy multimodal content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal hierarchical reasoning mechanism
Colqwen-optimized retrieval methods
Semantic verification via sub-question decomposition
🔎 Similar Papers
No similar papers found.
A
Ao Zhou
State Key Laboratory for Novel Software Technology, Nanjing University
Z
Zebo Gu
Chongqing University of Posts and Telecommunications
T
Tenghao Sun
Chongqing University of Posts and Telecommunications
J
Jiawen Chen
Chongqing University of Posts and Telecommunications
M
Mingsheng Tu
Chongqing University of Posts and Telecommunications
Z
Zifeng Cheng
State Key Laboratory for Novel software Technology, Nanjing University
Y
Yafeng Yin
State Key Laboratory for Novel software Technology, Nanjing University
Zhiwei Jiang
Zhiwei Jiang
Nanjing University
Natural Language Processing
Qing Gu
Qing Gu
Nanjing University