🤖 AI Summary
To address severe performance bottlenecks in script-language-based LLM orchestration—caused by synchronous blocking during remote API or LLM invocations—this paper proposes EPIC, the first opportunistic parallel lambda calculus model tailored for LLM orchestration. EPIC formally extends lambda calculus with semantics supporting asynchronous external calls (e.g., LLM inference, tool invocation), and introduces automated dynamic dependency analysis coupled with a streaming asynchronous scheduler to enable safe, early parallel execution without manual intervention while preserving semantic correctness. We prove that EPIC satisfies confluence and operational completeness. Evaluated on Tree-of-Thoughts reasoning and multi-tool coordination tasks, EPIC achieves up to 6.2× end-to-end latency reduction and up to 12.7× improvement in time-to-first-token, with runtime overhead only 1.3%–18.5% higher than hand-optimized Rust implementations.
📝 Abstract
Scripting languages are widely used to compose external calls, such as foreign functions that perform expensive computations, remote APIs, and more recently, machine learning systems such as large language models (LLMs). The execution time of scripts is often dominated by waiting for these external calls, and large speedups can be achieved via parallelization and streaming. However, doing this manually is challenging, even for expert programmers. To address this, we propose a novel opportunistic evaluation strategy for scripting languages based on a core lambda calculus that automatically executes external calls in parallel, as early as possible. We prove that our approach is confluent, ensuring that it preserves the programmer's original intent, and that our approach eventually executes every external call. We implement this approach in a framework called EPIC, embedded in Python. We demonstrate its versatility and performance on several applications drawn from the LLM literature, including Tree-of-Throughts and tool use. Our experiments show that opportunistic evaluation improves total running time (up to $6.2 imes$) and latency (up to $12.7 imes$) compared to several state-of-the-art baselines, while performing very close (between $1.3%$ and $18.5%$ running time overhead) to hand-tuned manually optimized parallel Rust implementations.