QCaption: Video Captioning and Q&A through Fusion of Large Multimodal Models

πŸ“… 2024-07-08
πŸ›οΈ Fusion
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes a fully self-contained multimodal fusion architecture to address the limitations of insufficient semantic understanding and reliance on external services in video captioning and question-answering tasks. By integrating keyframe extraction, a large vision-language model (LVM), and a large language model (LLM), the framework enables end-to-end, efficient video understanding without dependence on external APIs, thereby supporting fully local deployment. Experimental results demonstrate significant performance gains, with up to 44.2% improvement in video captioning and 48.9% enhancement in video-based question answering, substantially advancing the system’s accuracy, practicality, and deployability in real-world scenarios.

Technology Category

Application Category

πŸ“ Abstract
This paper introduces QCaption, a novel video captioning and Q&A pipeline that enhances video analytics by fusing three models: key frame extraction, a Large Multimodal Model (LMM) for image-text analysis, and a Large Language Model (LLM) for text analysis. This approach enables integrated analysis of text, images, and video, achieving performance improvements over existing video captioning and Q&A models; all while remaining fully self-contained, adept for on-premises deployment. Experimental results using QCaption demonstrated up to $\mathbf{4 4. 2 \%}$ and $\mathbf{4 8. 9 \%}$ improvements in video captioning and Q&A tasks, respectively. Ablation studies were also performed to assess the role of LLM on the fusion on the results. Moreover, the paper proposes and evaluates additional video captioning approaches, benchmarking them against QCaption and existing methodologies. QCaption demonstrate the potential of adopting a model fusion approach in advancing video analytics.
Problem

Research questions and friction points this paper is trying to address.

video captioning
video question answering
multimodal fusion
large language models
video analytics
Innovation

Methods, ideas, or system contributions that make the work stand out.

model fusion
large multimodal model
video captioning
video question answering
on-premises deployment
πŸ”Ž Similar Papers
No similar papers found.
Jiale Wang
Jiale Wang
HKUST, BUPT
Medical robots
G
Gee Wah Ng
Q Team, Home Team Science and Technology Agency, Singapore
L
L. Mak
Q Team, Home Team Science and Technology Agency, Singapore
R
Randall Cher
Department of Computer Science and Engineering, Nanyang Technological University, Singapore
N
Ng Ding Hei Ryan
Department of Computer Science and Engineering, Nanyang Technological University, Singapore
D
Davis Wang
Department of Computer Science and Engineering, Nanyang Technological University, Singapore