🤖 AI Summary
Large language models (LLMs) exhibit limited performance on scientific workflow tasks—including configuration, annotation, translation, explanation, and generation—primarily due to insufficient domain knowledge. Method: This work presents the first systematic evaluation of over 20 open- and closed-source LLMs (e.g., Llama, GPT series) across mainstream workflow systems (e.g., Apache Airflow, Snakemake), employing customized prompts and a multidimensional evaluation protocol tailored to workflow semantics and execution constraints. Results: LLM accuracy on workflow tasks is substantially lower than on general NLP benchmarks; cross-system performance varies by over 40%, confirming that capabilities are highly sensitive to both task type and system architecture. The study identifies domain knowledge deficiency as the fundamental bottleneck and proposes transferable prompt optimization strategies and domain alignment techniques. It establishes the first empirical benchmark and methodological framework for leveraging LLMs in research automation.
📝 Abstract
With the advent of large language models (LLMs), there is a growing interest in applying LLMs to scientific tasks. In this work, we conduct an experimental study to explore applicability of LLMs for configuring, annotating, translating, explaining, and generating scientific workflows. We use 5 different workflow specific experiments and evaluate several open- and closed-source language models using state-of-the-art workflow systems. Our studies reveal that LLMs often struggle with workflow related tasks due to their lack of knowledge of scientific workflows. We further observe that the performance of LLMs varies across experiments and workflow systems. Our findings can help workflow developers and users in understanding LLMs capabilities in scientific workflows, and motivate further research applying LLMs to workflows.