🤖 AI Summary
This work addresses the task of generating semantically coherent titles for financial short videos. We introduce the first multimodal captioning benchmark specifically designed for this domain. Methodologically, we systematically evaluate seven modality combinations—text (T), audio (A), video (V), and their pairwise (e.g., TV, AV) and triple (TAV) fusions—on a carefully curated dataset of 624 samples annotated across five financial themes, including primary recommendation and sentiment analysis. Key findings reveal that the unimodal video stream achieves superior performance on four themes; certain bimodal configurations (e.g., TV, AV) outperform the trimodal TAV fusion, indicating that visual cues dominate information contribution while multimodal integration may introduce noise. The study uncovers unique challenges and characteristics of modality synergy in financial short videos. To foster reproducible research, we publicly release the dataset, source code, and evaluation framework—establishing a foundational resource for future multimodal financial content understanding.
📝 Abstract
We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.