🤖 AI Summary
This study addresses the long-standing reliance on manual annotation in embryo assessment within assisted reproductive technologies, which lacks automated, standardized multimodal natural language description capabilities. To bridge this gap, the authors introduce InVitroVision, an end-to-end model that adapts the foundational vision–language model PaliGemma-2 to the in vitro fertilization (IVF) domain. By fine-tuning on a limited set of time-lapse embryo images paired with corresponding textual descriptions, the model generates high-quality natural language summaries of embryo morphology and developmental stage. Experimental results demonstrate that InVitroVision significantly outperforms both the commercial model ChatGPT 5.2 and baseline approaches under extremely few-shot settings, with performance further improving as training data scale increases, highlighting its strong potential for few-shot transfer learning in clinical embryology.
📝 Abstract
The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.