🤖 AI Summary
This work addresses the high latency and kernel launch overhead that hinder large language models (LLMs) in short-sequence interactive inference. The authors propose a hybrid runtime framework that, for the first time, synergistically integrates just-in-time (JIT) compilation with dynamic CUDA Graph execution for LLM inference. During autoregressive decoding, the Transformer computation is partitioned into static components—replayed via CUDA Graphs—and dynamic components—handled by JIT-compiled kernels—while supporting asynchronous graph capture and cross-step reuse. This approach effectively balances low launch overhead with runtime flexibility. Evaluated on LLaMA-2 7B with batch size 1, the method reduces first-token latency by up to 66.0% and achieves better P99 latency than TensorRT-LLM.
📝 Abstract
Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch overhead while preserving runtime flexibility during autoregressive decoding. The framework partitions transformer inference into static components executed via CUDA Graph replay and dynamic components handled through JIT-compiled kernels, enabling asynchronous graph capture and reuse across decoding steps.
We evaluate the proposed approach on LLaMA-2 7B using single-GPU, batch-size-one inference across prompt lengths from 10 to 500 tokens. Experimental results show that the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime. These results indicate that hybrid JIT-CUDA Graph execution can effectively reduce inference latency and variance for short-sequence LLM workloads, making it a practical optimization strategy for latency-sensitive AI applications.