ViLL-E: Video LLM Embeddings for Retrieval

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current video large language models (VLLMs) underperform specialized embedding models on retrieval tasks. This work proposes a unified architecture that, for the first time, incorporates a variable-length embedding generation mechanism, enabling the model to dynamically adjust its inference duration based on video complexity. The approach integrates generative and contrastive learning objectives through a three-stage training strategy: large-scale video-caption pretraining, fine-grained caption continual training, and multi-task fine-tuning. The resulting model achieves performance gains of 7% and 4% on temporal localization and video retrieval tasks, respectively, matching the performance of dedicated embedding models. Furthermore, it surpasses existing state-of-the-art methods by 5% and 2% on compositional video retrieval and long-text retrieval, demonstrating, for the first time, zero-shot compositional retrieval capabilities in a video large language model.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

Problem

Research questions and friction points this paper is trying to address.

Video Large Language Models

Video Retrieval

Embedding-based Models

Moment Retrieval

Text-to-Video Retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Large Language Model

Embedding Generation

Contrastive-Generative Training