Senior Research Engineer / Scientist - Storage for LLM

About the job

We are seeking a systems researcher or engineer with deep expertise in large-scale distributed storage and caching infrastructure to design and maintain a high-performance KV cache layer for large language model (LLM) inference. This role focuses on improving latency, throughput, and cost-efficiency in transformer-based model serving by optimizing the reuse of attention key-value states and prompt embeddings. You’ll work on cutting-edge AI systems problems with real-world impact, alongside a world-class team. The role offers opportunities to publish, contribute to open-source, attend top conferences, and enjoy competitive compensation, generous research resources, and an innovation-driven culture.

Responsibilities

- Design and implement a distributed KV cache system to store and retrieve intermediate states (e.g., attention keys/values) for transformer-based LLMs across GPUs or nodes.

- Optimize low-latency access and eviction policies for caching long-context LLM inputs, token streams, and reused embeddings.

- Collaborate with inference and serving teams to integrate the cache with token streaming pipelines, batched decoding, and model parallelism.

- Develop cache consistency and synchronization protocols for multi-tenant, multi-request environments.

- Implement memory-aware sharding, eviction (e.g., windowed LRU, TTL), and replication strategies across GPUs or distributed memory backends.

- Monitor system performance and iterate on caching algorithms to reduce compute costs and response time for inference workloads.

- Evaluate and, where needed, extend open-source KV stores or build custom GPU-aware caching layers (e.g., CUDA, Triton, shared memory, RDMA).

Qualifications

Minimum

- PhD in Computer Science, Applied Mathematics, Electrical Engineering, or a related technical field.

- Strong understanding of transformer-based model internals and how KV caching affects autoregressive decoding.

- Experience with distributed systems, memory management, and low-latency serving (RPC, gRPC, CUDA-aware networking).

- Familiarity with high-performance compute environments (NVIDIA GPUs, TensorRT, Triton Inference Server).

- Proficiency in languages like C++, Rust, Go, or CUDA for systems-level development.

Preferred

- Prior experience building inference-serving systems for LLMs (e.g., vLLM, SGLang, FasterTransformer, DeepSpeed, Hugging Face Text Generation Inference).

- Experience with memory hierarchy optimization (HBM, NUMA, NVLink) and GPU-to-GPU communication (NCCL, GDR, GDS, InfiniBand).

- Exposure to cache-aware scheduling, batching, and prefetching strategies in model serving.