Member of Technical Staff, Model Efficiency

Cohere
Toronto, Montreal, San Francisco, New York, Paris, Seoul, London2025-11-07Remote

About the job

Our team is a fast-growing group of researchers and engineers focused on building reliable ML systems and pushing the boundaries of LLM inference efficiency. We develop techniques that improve how models execute in production, driving lower latency, higher throughput, and consistent quality across diverse workloads.

Responsibilities

work across the inference stack to improve core performance metrics by diving deep into model execution, identifying bottlenecks, and developing innovative optimizations. You’ll collaborate closely with modeling and systems teams to experiment, measure, and ship improvements that meaningfully accelerate inference. As the team evolves, you’ll have opportunities to build expertise in advanced performance techniques, including GPU/CUDA optimizations, kernel-level improvements, and model execution strategies for MoE and large-scale architectures.

Qualifications

Minimum

5+ years of experience writing high-performance, production-quality code

Strong programming skills in C++ or Python (Rust/Go also welcome)

Experience working with large language models and familiarity with the LLM inference ecosystem (e.g., vLLM, SGLang, etc.)

Ability to diagnose and resolve performance bottlenecks across the model execution stack

A strong bias for action — you ship fast, measure impact, and iterate

Preferred

GPU programming, CUDA, or low-level systems optimization

Language modeling with transformers (MoE, speculative decoding, KV-cache optimizations)

Scaling performance-critical distributed systems (e.g., computation, search, storage)