FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the task of generating semantically coherent titles for financial short videos. We introduce the first multimodal captioning benchmark specifically designed for this domain. Methodologically, we systematically evaluate seven modality combinations—text (T), audio (A), video (V), and their pairwise (e.g., TV, AV) and triple (TAV) fusions—on a carefully curated dataset of 624 samples annotated across five financial themes, including primary recommendation and sentiment analysis. Key findings reveal that the unimodal video stream achieves superior performance on four themes; certain bimodal configurations (e.g., TV, AV) outperform the trimodal TAV fusion, indicating that visual cues dominate information contribution while multimodal integration may introduce noise. The study uncovers unique challenges and characteristics of modality synergy in financial short videos. To foster reproducible research, we publicly release the dataset, source code, and evaluation framework—establishing a foundational resource for future multimodal financial content understanding.

Technology Category

Application Category

📝 Abstract

We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal models for financial video captioning

Assessing joint reasoning across transcript, audio, and video modalities

Establishing baselines for topic-aligned captions in short financial videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates multimodal models for financial video captioning

Tests joint reasoning across text audio video modalities

Establishes first baselines for short-form financial videos

🔎 Similar Papers

No similar papers found.