Member of Technical Staff, Model Efficiency

About the job

Our team is a fast-growing group of researchers and engineers focused on building reliable ML systems and pushing the boundaries of LLM inference efficiency. We develop techniques that improve how models execute in production, driving lower latency, higher throughput, and consistent quality across diverse workloads.

Responsibilities

- Work across the inference stack to improve core performance metrics by diving deep into model execution, identifying bottlenecks, and developing innovative optimizations.

- Collaborate closely with modeling and systems teams to experiment, measure, and ship improvements that meaningfully accelerate inference.

- Build expertise in advanced performance techniques, including GPU/CUDA optimizations, kernel-level improvements, and model execution strategies for MoE and large-scale architectures.

Qualifications

Minimum

- 5+ years of experience writing high-performance, production-quality code

- Strong programming skills in C++ or Python (Rust/Go also welcome)

- Experience working with large language models and familiarity with the LLM inference ecosystem (e.g., vLLM, SGLang, etc.)

- Ability to diagnose and resolve performance bottlenecks across the model execution stack

Preferred

- GPU programming, CUDA, or low-level systems optimization

- Language modeling with transformers (MoE, speculative decoding, KV-cache optimizations)

- Scaling performance-critical distributed systems (e.g., computation, search, storage)