LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
Scientific workflows increasingly rely on the edge–cloud–HPC continuum, generating large-scale, structurally complex provenance data; existing analysis approaches—based on scripts, SQL, or static dashboards—suffer from poor interactivity and weak semantic understanding. To address this, we propose the first LLM-based agent system specifically designed for workflow provenance analysis. Our method introduces a modular reference architecture and a dedicated evaluation framework, integrating prompt tuning, retrieval-augmented generation (RAG), and natural language-to-structured query translation to enable deep semantic parsing of provenance metadata and generate insights beyond raw log analysis. The system adopts a lightweight, metadata-driven design and supports multiple foundation models—including LLaMA, GPT, Gemini, and Claude. Evaluated on real-world chemical workflows, it achieves significantly higher query accuracy and analytical depth, enabling dynamic, natural language–driven, interactive provenance exploration.

Technology Category

Application Category

📝 Abstract
Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.
Problem

Research questions and friction points this paper is trying to address.

Analyzing complex workflow provenance data at scale
Overcoming limitations of static dashboards and custom scripts
Enabling natural language interaction with provenance systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents for interactive provenance analysis
Lightweight metadata-driven natural language translation
Modular design with prompt tuning and RAG
🔎 Similar Papers
💼 Related Jobs
AI Data Engineer--LLMs / Agentic Systems
Pfizer
The annual base salary for this position ranges from $106,000.00 to $176,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 15.0% of the base salary and eligibility to participate in our share based long term incentive program. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
United States - Massachusetts - Cambridge