Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the high latency and kernel launch overhead that hinder large language models (LLMs) in short-sequence interactive inference. The authors propose a hybrid runtime framework that, for the first time, synergistically integrates just-in-time (JIT) compilation with dynamic CUDA Graph execution for LLM inference. During autoregressive decoding, the Transformer computation is partitioned into static components—replayed via CUDA Graphs—and dynamic components—handled by JIT-compiled kernels—while supporting asynchronous graph capture and cross-step reuse. This approach effectively balances low launch overhead with runtime flexibility. Evaluated on LLaMA-2 7B with batch size 1, the method reduces first-token latency by up to 66.0% and achieves better P99 latency than TensorRT-LLM.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch overhead while preserving runtime flexibility during autoregressive decoding. The framework partitions transformer inference into static components executed via CUDA Graph replay and dynamic components handled through JIT-compiled kernels, enabling asynchronous graph capture and reuse across decoding steps. We evaluate the proposed approach on LLaMA-2 7B using single-GPU, batch-size-one inference across prompt lengths from 10 to 500 tokens. Experimental results show that the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime. These results indicate that hybrid JIT-CUDA Graph execution can effectively reduce inference latency and variance for short-sequence LLM workloads, making it a practical optimization strategy for latency-sensitive AI applications.

Problem

Research questions and friction points this paper is trying to address.

inference latency

kernel launch overhead

large language models

short-sequence inference

low-latency deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

JIT compilation

CUDA Graph

low-latency inference