InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This study addresses the long-standing reliance on manual annotation in embryo assessment within assisted reproductive technologies, which lacks automated, standardized multimodal natural language description capabilities. To bridge this gap, the authors introduce InVitroVision, an end-to-end model that adapts the foundational vision–language model PaliGemma-2 to the in vitro fertilization (IVF) domain. By fine-tuning on a limited set of time-lapse embryo images paired with corresponding textual descriptions, the model generates high-quality natural language summaries of embryo morphology and developmental stage. Experimental results demonstrate that InVitroVision significantly outperforms both the commercial model ChatGPT 5.2 and baseline approaches under extremely few-shot settings, with performance further improving as training data scale increases, highlighting its strong potential for few-shot transfer learning in clinical embryology.

Technology Category

Application Category

📝 Abstract
The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.
Problem

Research questions and friction points this paper is trying to address.

embryo development
natural language description
in vitro fertilization
multimodal AI
vision-language model
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model
few-shot learning
embryo development
natural language generation
multimodal AI
🔎 Similar Papers
No similar papers found.
N
Nicklas Neu
Software Competence Center Hagenberg GmbH, Softwarepark 32a, Hagenberg, 4232, Upper Austria, Austria
Thomas Ebner
Thomas Ebner
Fraunhofer Heinrich Hertz Institute, HHI
Volumetric VideoVirtual RealityAugmented RealityMixed Reality
J
Jasmin Primus
Wunschkind Klinik Dr. Brunbauer, Ebendorferstraße 6/4 Vienna, 1010, Vienna, Austria
R
Raphael Zefferer
Software Competence Center Hagenberg GmbH, Softwarepark 32a, Hagenberg, 4232, Upper Austria, Austria
B
Bernhard Schenkenfelder
Software Competence Center Hagenberg GmbH, Softwarepark 32a, Hagenberg, 4232, Upper Austria, Austria
M
Mathias Brunbauer
Wunschkind Klinik Dr. Brunbauer, Ebendorferstraße 6/4 Vienna, 1010, Vienna, Austria
F
Florian Kromp
Software Competence Center Hagenberg GmbH, Softwarepark 32a, Hagenberg, 4232, Upper Austria, Austria