SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work investigates the knowledge transfer capability of video foundation models (ViFMs) across interdisciplinary scientific tasks—specifically, whether representations pretrained on generic data can effectively adapt to heterogeneous domains including medical imaging, animal behavior analysis, and weather forecasting, challenging domain-specific models. To this end, we introduce SciVid, the first benchmark for multi-domain scientific video understanding, comprising five diverse tasks. Within a unified evaluation framework, we systematically assess the transfer performance of six state-of-the-art ViFMs augmented with trainable readout modules. Results show that several general-purpose ViFMs match or surpass domain-specific baselines across multiple tasks, validating their cross-domain generalization potential; however, they also expose critical bottlenecks in fine-grained spatiotemporal modeling and domain-specific semantic alignment. This work establishes a foundational benchmark, provides empirical evidence, and offers actionable insights toward universal modeling in scientific AI.

Technology Category

Application Category

📝 Abstract

In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating video models across diverse scientific disciplines

Assessing transferability of pretrained video foundation models

Benchmarking performance against domain-specific baselines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video foundation models for cross-domain tasks

Trainable readout modules for model adaptation

Benchmarking diverse scientific video applications

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4