FlowMesh: A Service Fabric for Composable LLM Workflows

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current AI deployment is shifting from single-task LLMs to multi-tenant workflows comprising data transformation, fine-tuning, and agent interaction—posing challenges in cross-user computation deduplication, inefficient batch processing, and non-coordinated scheduling across heterogeneous GPU resources. This paper proposes a composable LLM workflow-oriented multi-tenant service architecture: it decomposes workflows into fine-grained operators and integrates data lineage tracking with content-addressable storage to enable cross-user computational reuse and stateless elastic scaling. A unified utility-function-driven global control plane jointly optimizes throughput, cost, and data locality while supporting heterogeneous GPU platforms (e.g., Kubernetes, Vast.ai). Experiments show that our approach reduces cost by up to 3.8× and energy consumption by up to 2.0× versus baseline systems, achieves comparable or lower latency, and maintains high stability under dynamic workloads and failures.

Technology Category

Application Category

📝 Abstract

AI deployment increasingly resembles a pipeline of data transformation, fine-tuning, and agent interactions rather than a monolithic LLM job; recent examples include RLHF/RLAIF training and agentic workflows. To cope with this shift, we propose FlowMesh, a multi-tenant service fabric that executes and optimizes these workloads as one shared service instead of isolated pipelines. It decomposes workflows into fine-grained operators with recorded lineage, enabling de-duplication of work across users and batching requests on the same hardware while preserving per-workflow provenance. A global control plane maintains a cluster-wide pool of ready operators and uses a single utility function to pick both the batch and the worker, balancing throughput, cost, and data locality on heterogeneous GPUs. The data plane is an elastic fleet of stateless workers backed by a content-addressable store, enabling rapid, automatic scale-out, safe retry after preemption, and portability across managed clusters such as Kubernetes and geo-distributed GPU marketplaces such as Vast.ai. Compared with baseline solutions, FlowMesh achieves up to 3.8x cost reduction and 2.0x lower energy usage, provides a similar or better latency profile, and remains efficient under dynamic and failure-prone conditions.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-tenant LLM workflows as shared services

Decomposing workflows into fine-grained operators with lineage

Balancing throughput, cost, and data locality on heterogeneous GPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-tenant service fabric for shared LLM workflows

Fine-grained operators with lineage for deduplication and batching

Global control plane optimizing heterogeneous GPU utilization

🔎 Similar Papers

No similar papers found.