🤖 AI Summary
This study systematically evaluates the accuracy–efficiency trade-offs of different video representation strategies for multimodal large language models (MLLMs) in video question answering (VideoQA). We conduct unified experiments on VideoMME and LongVideoBench, comparing unimodal inputs—visual frames, ASR-generated subtitles, and audio features—as well as their multimodal combinations. Key-frame sampling, automatic speech recognition (ASR), and audio feature extraction are employed to generate representations, which are then fused with MLLMs for joint modeling and inference. Results show that visual frames yield the highest accuracy but incur substantial computational overhead; in contrast, subtitles serve as a lightweight semantic representation, achieving the optimal balance between performance and efficiency—especially for long videos. To our knowledge, this is the first work to quantitatively characterize how modality selection governs the resource–accuracy trade-off in VideoQA systems. Our findings provide reproducible empirical evidence and practical design guidelines for deploying efficient, real-world VideoQA architectures.
📝 Abstract
Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.