FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes FlowSteer, an end-to-end reinforcement learning framework for interactive workflow orchestration that addresses key limitations of existing approaches, such as high manual effort, reliance on specific operators or large language models (LLMs), and sparse reward signals. FlowSteer features a lightweight policy model, an executable canvas environment, and a multi-turn state-action interaction mechanism. It introduces a novel canvas architecture supporting a pluggable operator library and swappable LLM backends, along with a tailored Canvas Workflow Relative Policy Optimization (CWRPO) algorithm to stabilize training and suppress shortcut behaviors. Extensive experiments across 12 datasets demonstrate that FlowSteer significantly outperforms current baselines, exhibiting strong generalization and task adaptability.

Technology Category

Application Category

📝 Abstract
In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.
Problem

Research questions and friction points this paper is trying to address.

workflow orchestration
manual cost
LLM dependency
sparse rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlowSteer
end-to-end reinforcement learning
workflow orchestration
CWRPO
agentic workflows
🔎 Similar Papers
No similar papers found.