Efficient Function-as-a-Service for Large Language Models with TIDAL

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language models (LLMs) deployed on Function-as-a-Service (FaaS) platforms suffer from severe GPU cold-start latency due to high GPU memory footprint, dynamic runtime initialization, and lazy loading of CUDA kernels—low-level execution behaviors that are opaque to the FaaS scheduler. Method: We propose TrajGen, the first adaptive function template generation framework leveraging fine-grained GPU execution trajectory awareness. TrajGen dynamically captures LLM runtime behavior, models critical initialization paths, and generates lightweight, model-specific function templates to enable proactive GPU resource pre-allocation and CUDA kernel warm-up—without modifying the model or runtime, and with full compatibility across mainstream FaaS platforms. Results: Experiments show TrajGen reduces cold-start latency by 1.79×–2.11× over state-of-the-art methods and cuts the 95th-percentile first-token latency by 76.0%, significantly improving both response efficiency and GPU resource utilization for LLM-based serverless functions.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) applications have emerged as a prominent use case for Function-as-a-Service (FaaS) due to their high computational demands and sporadic invocation patterns. However, serving LLM functions within FaaS frameworks faces significant GPU-side cold start. A fundamental approach involves leveraging a template with function state saved on GPUs to bypass the cold start for new invocations. Yet, this approach struggles with the high GPU footprint, dynamic initialization behaviors, and lazy GPU kernel loading inherent in LLM functions, primarily due to a lack of insight into the underlying execution details. In this paper, we introduce TIDAL, an optimized FaaS framework for LLM applications that achieves fast startups by tracing fine-grained execution paths. By utilizing the traced execution details, TIDAL generates adaptive function templates, effectively breaking startup barriers for LLM functions. Extensive evaluations demonstrate that TIDAL reduces cold start latency by $1.79 imes ext{ extasciitilde}2.11 imes$ and improves the $95%$-ile time-to-first-token by $76.0%$, surpassing state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses GPU-side cold start in FaaS for LLM applications

Reduces high GPU footprint and dynamic initialization issues

Improves startup latency and time-to-first-token for LLM functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

TIDAL optimizes FaaS for LLM applications

TIDAL traces fine-grained execution paths

TIDAL generates adaptive function templates

🔎 Similar Papers

No similar papers found.