An Empirical Study for Representations of Videos in Video Question Answering via MLLMs

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study systematically evaluates the accuracy–efficiency trade-offs of different video representation strategies for multimodal large language models (MLLMs) in video question answering (VideoQA). We conduct unified experiments on VideoMME and LongVideoBench, comparing unimodal inputs—visual frames, ASR-generated subtitles, and audio features—as well as their multimodal combinations. Key-frame sampling, automatic speech recognition (ASR), and audio feature extraction are employed to generate representations, which are then fused with MLLMs for joint modeling and inference. Results show that visual frames yield the highest accuracy but incur substantial computational overhead; in contrast, subtitles serve as a lightweight semantic representation, achieving the optimal balance between performance and efficiency—especially for long videos. To our knowledge, this is the first work to quantitatively characterize how modality selection governs the resource–accuracy trade-off in VideoQA systems. Our findings provide reproducible empirical evidence and practical design guidelines for deploying efficient, real-world VideoQA architectures.

Technology Category

Application Category

📝 Abstract

Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating effective video representations for MLLMs in VideoQA

Balancing accuracy and computational efficiency across modalities

Assessing trade-offs between visual frames and subtitles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates single modality inputs for video question answering

Compares visual frames and subtitles for efficiency trade-offs

Provides insights for resource-aware multimodal system design

🔎 Similar Papers

VideoQA in the Era of LLMs: An Empirical Study