ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task

📅 2025-04-20

📈 Citations: 1

✨ Influential: 1

career value

191K/year

🤖 AI Summary

For zero-shot video understanding, this paper proposes a cross-modal framework that eliminates the need for pre-trained video models. The method employs an off-the-shelf, non-video-pretrained ResNet as the visual encoder, directly interfaced with a large language model (LLM), and performs end-to-end joint optimization to align visual and linguistic representations. Its core contribution lies in departing from conventional video-specific pretraining paradigms: it pioneers the use of a frozen ResNet backbone—unfined on video data—combined with zero-shot prompt learning and cross-modal feature mapping, achieving superior generalization while preserving architectural simplicity. Experiments demonstrate state-of-the-art zero-shot performance on four standard benchmarks: MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce ResNetVLLM (ResNet Vision LLM), a novel cross-modal framework for zero-shot video understanding that integrates a ResNet-based visual encoder with a Large Language Model (LLM. ResNetVLLM addresses the challenges associated with zero-shot video models by avoiding reliance on pre-trained video understanding models and instead employing a non-pretrained ResNet to extract visual features. This design ensures the model learns visual and semantic representations within a unified architecture, enhancing its ability to generate accurate and contextually relevant textual descriptions from video inputs. Our experimental results demonstrate that ResNetVLLM achieves state-of-the-art performance in zero-shot video understanding (ZSVU) on several benchmarks, including MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot video understanding without pre-trained models

Integrating ResNet and LLM for multi-modal learning

Generating accurate textual descriptions from video inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

ResNet-based visual encoder with LLM

Non-pretrained ResNet for feature extraction

Unified visual and semantic representation learning

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?