Teola: Towards End-to-End Optimization of LLM-based Applications

📅 2024-06-29

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing LLM application frameworks employ coarse-grained modular orchestration, optimizing only within individual components while neglecting coordinated scheduling across LLM and non-LLM components—leading to high end-to-end latency. Method: We propose a primitive-level fine-grained dataflow modeling approach that decomposes tasks into unified primitive units and constructs a joint scheduling graph spanning both LLM and non-LLM components. We further design a dynamic pipeline orchestration engine with primitive-level parallelism and pipelining scheduling algorithms to achieve end-to-end co-optimization. Contribution/Results: This work establishes the first primitive-level end-to-end orchestration paradigm, explicitly exposing broader optimization opportunities beyond isolated module throughput gains. Evaluated on diverse state-of-the-art LLM applications, our approach reduces end-to-end latency by up to 2.09× over SOTA systems, demonstrating both effectiveness and generality.

Technology Category

Application Category

📝 Abstract

Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.

Problem

Research questions and friction points this paper is trying to address.

Optimize end-to-end latency in LLM-based applications

Address coarse-grained orchestration limiting cross-module optimizations

Improve scheduling via fine-grained primitive-level dataflow graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained end-to-end orchestration for LLM apps

Primitive-level dataflow graph representation

Optimizes parallelization and pipelining across modules

🔎 Similar Papers

No similar papers found.