About the job
Our team is a fast-growing group of researchers and engineers focused on building reliable ML systems and pushing the boundaries of LLM inference efficiency. We develop techniques that improve how models execute in production, driving lower latency, higher throughput, and consistent quality across diverse workloads.
Responsibilities
work across the inference stack to improve core performance metrics by diving deep into model execution, identifying bottlenecks, and developing innovative optimizations. You’ll collaborate closely with modeling and systems teams to experiment, measure, and ship improvements that meaningfully accelerate inference. As the team evolves, you’ll have opportunities to build expertise in advanced performance techniques, including GPU/CUDA optimizations, kernel-level improvements, and model execution strategies for MoE and large-scale architectures.
Qualifications
Minimum
5+ years of experience writing high-performance, production-quality code
Strong programming skills in C++ or Python (Rust/Go also welcome)
Experience working with large language models and familiarity with the LLM inference ecosystem (e.g., vLLM, SGLang, etc.)
Ability to diagnose and resolve performance bottlenecks across the model execution stack
A strong bias for action — you ship fast, measure impact, and iterate
Preferred
GPU programming, CUDA, or low-level systems optimization
Language modeling with transformers (MoE, speculative decoding, KV-cache optimizations)
Scaling performance-critical distributed systems (e.g., computation, search, storage)