🤖 AI Summary
Financial multimodal reasoning faces dual challenges: scarcity of high-quality training data and inefficiency of existing training paradigms. To address these, this paper proposes an automated multimodal reasoning enhancement framework tailored for financial report understanding. First, it introduces a novel *disentangled image-text alignment QA generation* paradigm, yielding a high-quality dataset of 89,378 financial report–image–question–answer quadruples. Second, it designs a two-stage adversarial reinforcement training mechanism that jointly optimizes format compliance, factual accuracy, image relevance, chain-of-thought (CoT) length, and adversarial reward. Third, it incorporates contrastive multi-image sampling, finance-specific vision–language alignment, and CoT reward modeling. Evaluated on seven financial multimodal benchmarks, the framework achieves significant improvements in answer accuracy and reasoning depth over state-of-the-art methods, demonstrating both effectiveness and strong generalization capability.
📝 Abstract
Large Multimodal Models (LMMs) demonstrate significant cross-modal reasoning capabilities. However, financial applications face challenges due to the lack of high-quality multimodal reasoning datasets and the inefficiency of existing training paradigms for reasoning enhancement. To address these issues, we propose an integrated framework, FinLMM-R1, combining an automated and scalable pipeline for data construction with enhanced training strategies to improve the multimodal reasoning of LMM. The Automated and Scalable Pipeline (ASP) resolves textual-visual misalignment in financial reports through a separate paradigm of question-answer generation and image-question alignment, ensuring data integrity and extraction efficiency. Through ASP, we collect 89,378 aligned image-question pairs from 23,397 financial reports, covering tasks such as arithmetic reasoning, statistics reasoning, financial explanation, and financial knowledge. Moreover, we introduce the Thinking with Adversarial Reward in LMM (TAR-LMM), extending the prior two-stage training framework [1] with additional reward mechanisms. In the first stage, we focus on text-only tasks with format and accuracy rewards to guide the model in generating well-structured thinking contents. In the second stage, we construct multi-image contrastive samples with additional reward components including image selection, thinking content length, and adversarial reward to jointly optimize the LMM across visual perception, reasoning efficiency, and logical coherence. Extensive experiments on 7 benchmarks show ASP-derived dataset and training framework significantly improve answer accuracy and reasoning depth over existing reasoning LMMs in both general and financial multimodal contexts.