Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Financial consultation videos (30–40 minutes long) pose challenges due to their length, tight coupling of multimodal information, and difficulty in cross-modal alignment. To address these, we propose FASTER: a framework integrating BLIP for vision-language semantic modeling, OCR for on-screen text extraction, and speaker-aware Whisper for speech transcription—enabling modality-aligned visual-textual summarization. We further introduce a factuality-enhanced Direct Preference Optimization (DPO) strategy and keyframe retrieval with ranking to improve cross-modal consistency and interpretability. Evaluated on our curated Fin-APT dataset, FASTER significantly outperforms state-of-the-art large models in summary quality, critical information recall, and robustness. It demonstrates strong cross-domain generalization, offering a practical, deployable solution for multimodal financial content understanding.

Technology Category

Application Category

📝 Abstract
The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER
Problem

Research questions and friction points this paper is trying to address.

Extracting insights from lengthy multimodal financial advisory videos
Aligning visual keyframes with relevant textual summary points
Addressing data scarcity for robust multimodal financial research
Innovation

Methods, ideas, or system contributions that make the work stand out.

BLIP OCR Whisper for multimodal feature extraction
Modified DPO loss with fact-checking for precision
Ranker-based retrieval aligns keyframes with text
🔎 Similar Papers
No similar papers found.
Sarmistha Das
Sarmistha Das
Indian Institute Of Technology Patna
MLDLNLPFinTEch
R
R. E. Zera Marveen Lyngkhoi
Indian Institute of Technology Patna, Patna, India
S
Sriparna Saha
Indian Institute of Technology Patna, Patna, India
A
Alka Maurya
CRISIL LTD, Mumbai, India