🤖 AI Summary
This work investigates the knowledge transfer capability of video foundation models (ViFMs) across interdisciplinary scientific tasks—specifically, whether representations pretrained on generic data can effectively adapt to heterogeneous domains including medical imaging, animal behavior analysis, and weather forecasting, challenging domain-specific models. To this end, we introduce SciVid, the first benchmark for multi-domain scientific video understanding, comprising five diverse tasks. Within a unified evaluation framework, we systematically assess the transfer performance of six state-of-the-art ViFMs augmented with trainable readout modules. Results show that several general-purpose ViFMs match or surpass domain-specific baselines across multiple tasks, validating their cross-domain generalization potential; however, they also expose critical bottlenecks in fine-grained spatiotemporal modeling and domain-specific semantic alignment. This work establishes a foundational benchmark, provides empirical evidence, and offers actionable insights toward universal modeling in scientific AI.
📝 Abstract
In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.