StarFlow: Generating Structured Workflow Outputs From Sketch Images

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the complexity of workflow construction in enterprise low-code platforms and the challenge of automatically converting hand-drawn or generative sketches into executable workflows, this paper proposes the first end-to-end sketch-to-structured-workflow generation framework based on vision-language models (VLMs). Methodologically, we curate a multi-source workflow graph dataset—comprising synthetic, human-annotated, and real-world examples—and integrate VLM fine-tuning, multimodal graph understanding, structured decoding via JSON Schema, and style-robust training. Our key contributions are: (1) the first automatic parsing of sketches into topologically complete, control-flow-accurate workflows; (2) a 32.7% improvement in structural accuracy over zero-shot large language models through VLM fine-tuning; and (3) cross-platform workflow export capability. Experiments demonstrate high-fidelity node and logical structure recovery across diverse sketch inputs.

Technology Category

Application Category

📝 Abstract

Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.

Problem

Research questions and friction points this paper is trying to address.

Generating structured workflows from sketch images

Overcoming ambiguity in free-form workflow diagrams

Enhancing vision-language models for workflow automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generating workflows from sketches using VLMs

Fine-tuning models for structured output generation

Diverse dataset for robust training and evaluation

🔎 Similar Papers

No similar papers found.