ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video similarity models rely on a single global similarity score, limiting their ability to support fine-grained, human-like, multi-perspective, and conceptually grounded comparisons. To address this, we propose Concept-level Video Similarity estimation (ConViS), the first semantic concept-driven, interpretable framework for video similarity assessment. ConViS defines predefined semantic concepts—such as actions and locations—and generates concept-specific similarity scores for video pairs. We introduce ConViS-Bench, a multi-domain, human-annotated benchmark dataset supporting concept-level evaluation. Furthermore, we design a large multimodal model–based approach that leverages natural language descriptions to enable cross-video semantic alignment. Extensive experiments demonstrate that ConViS substantially improves both interpretability and human agreement in similarity judgments. Moreover, it uncovers concept-specific performance disparities, revealing differential challenges across semantic dimensions and highlighting the need for concept-aware video understanding.

Technology Category

Application Category

📝 Abstract
What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.
Problem

Research questions and friction points this paper is trying to address.

Estimating video similarity through semantic concepts rather than global scores
Enabling human-like reasoning about video differences and similarities
Benchmarking models' alignment with human judgments on concept-level comparisons
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging Large Multimodal Models for video understanding
Introducing concept-level similarity scores for interpretable comparison
Creating a benchmark with annotated video pairs for evaluation
🔎 Similar Papers
No similar papers found.