About the job
We are seeking a systems researcher or engineer with deep expertise in large-scale distributed storage and caching infrastructure to design and maintain a high-performance KV cache layer for large language model (LLM) inference. This role focuses on improving latency, throughput, and cost-efficiency in transformer-based model serving by optimizing the reuse of attention key-value states and prompt embeddings.
Responsibilities
- Design and implement a distributed KV cache system to store and retrieve intermediate states (e.g., attention keys/values) for transformer-based LLMs across GPUs or nodes.
- Optimize low-latency access and eviction policies for caching long-context LLM inputs, token streams, and reused embeddings.
- Collaborate with inference and serving teams to integrate the cache with token streaming pipelines, batched decoding, and model parallelism.
- Develop cache consistency and synchronization protocols for multi-tenant, multi-request environments.
- Implement memory-aware sharding, eviction (e.g., windowed LRU, TTL), and replication strategies across GPUs or distributed memory backends.
- Monitor system performance and iterate on caching algorithms to reduce compute costs and response time for inference workloads.
- Evaluate and, where needed, extend open-source KV stores or build custom GPU-aware caching layers (e.g., CUDA, Triton, shared memory, RDMA).
Qualifications
Minimum
- PhD in Computer Science, Applied Mathematics, Electrical Engineering, or a related technical field.
- Strong understanding of transformer-based model internals and how KV caching affects autoregressive decoding.
- Experience with distributed systems, memory management, and low-latency serving (RPC, gRPC, CUDA-aware networking).
- Familiarity with high-performance compute environments (NVIDIA GPUs, TensorRT, Triton Inference Server).
- Proficiency in languages like C++, Rust, Go, or CUDA for systems-level development.
Preferred
- Prior experience building inference-serving systems for LLMs (e.g., vLLM, SGLang, FasterTransformer, DeepSpeed, Hugging Face Text Generation Inference).
- Experience with memory hierarchy optimization (HBM, NUMA, NVLink) and GPU-to-GPU communication (NCCL, GDR, GDS, InfiniBand).
- Exposure to cache-aware scheduling, batching, and prefetching strategies in model serving.