🤖 AI Summary
For zero-shot video understanding, this paper proposes a cross-modal framework that eliminates the need for pre-trained video models. The method employs an off-the-shelf, non-video-pretrained ResNet as the visual encoder, directly interfaced with a large language model (LLM), and performs end-to-end joint optimization to align visual and linguistic representations. Its core contribution lies in departing from conventional video-specific pretraining paradigms: it pioneers the use of a frozen ResNet backbone—unfined on video data—combined with zero-shot prompt learning and cross-modal feature mapping, achieving superior generalization while preserving architectural simplicity. Experiments demonstrate state-of-the-art zero-shot performance on four standard benchmarks: MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.
📝 Abstract
In this paper, we introduce ResNetVLLM (ResNet Vision LLM), a novel cross-modal framework for zero-shot video understanding that integrates a ResNet-based visual encoder with a Large Language Model (LLM. ResNetVLLM addresses the challenges associated with zero-shot video models by avoiding reliance on pre-trained video understanding models and instead employing a non-pretrained ResNet to extract visual features. This design ensures the model learns visual and semantic representations within a unified architecture, enhancing its ability to generate accurate and contextually relevant textual descriptions from video inputs. Our experimental results demonstrate that ResNetVLLM achieves state-of-the-art performance in zero-shot video understanding (ZSVU) on several benchmarks, including MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.