🤖 AI Summary
Existing multimodal large language models (MLLMs) struggle with multi-choice question answering over complex-layout, long-text PDF documents—especially in Japanese—due to poor layout understanding, shallow semantic parsing, and severe English-language bias. This paper proposes a vision-language hierarchical reasoning framework that integrates a sub-question decomposition–driven semantic verification mechanism with a Colqwen-optimized cross-lingual retrieval module, enabling fine-grained layout awareness and deep semantic alignment. Key contributions are: (i) the first coupling of hierarchical reasoning with semantic verification for multimodal PDF QA; and (ii) enhanced cross-lingual representation for non-English documents via Colqwen, mitigating language bias in training data. Experiments on a Japanese ten-option PDF QA benchmark demonstrate significant improvements over state-of-the-art models, achieving a 12.3% absolute accuracy gain, alongside enhanced robustness and deployment adaptability.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.