Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Financial consultation videos (30–40 minutes long) pose challenges due to their length, tight coupling of multimodal information, and difficulty in cross-modal alignment. To address these, we propose FASTER: a framework integrating BLIP for vision-language semantic modeling, OCR for on-screen text extraction, and speaker-aware Whisper for speech transcription—enabling modality-aligned visual-textual summarization. We further introduce a factuality-enhanced Direct Preference Optimization (DPO) strategy and keyframe retrieval with ranking to improve cross-modal consistency and interpretability. Evaluated on our curated Fin-APT dataset, FASTER significantly outperforms state-of-the-art large models in summary quality, critical information recall, and robustness. It demonstrates strong cross-domain generalization, offering a practical, deployable solution for multimodal financial content understanding.

Technology Category

Application Category

📝 Abstract

The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER

Problem

Research questions and friction points this paper is trying to address.

Extracting insights from lengthy multimodal financial advisory videos

Aligning visual keyframes with relevant textual summary points

Addressing data scarcity for robust multimodal financial research

Innovation

Methods, ideas, or system contributions that make the work stand out.

BLIP OCR Whisper for multimodal feature extraction

Modified DPO loss with fact-checking for precision

Ranker-based retrieval aligns keyframes with text

🔎 Similar Papers

No similar papers found.