Engineering Manager, Model Serving

About the job

Together AI is building the AI Inference & Model Shaping Platform that brings the most advanced generative AI models to the world. Our platform powers multi-tenant server-less workloads and dedicated endpoints, enabling developers, enterprises, and researchers to harness the latest LLMs, multimodal models, image, audio, video, and reasoning models at scale. We are looking for an exceptional Engineering Lead to partner closely with our cross-functional engineering, infrastructure, research, and sales teams to ensure excellence of our ML API offerings. Your primary focus will be on delivering world-class inference and fine-tuning in our public APIs and customer deployments by building automation and operations processes.

Responsibilities

Own availability and performance SLAs for production inference and fine-tuning services across serverless and dedicated deployments

Own & improve testing, deployment, configuration management, and monitoring practices for multi-cluster ML infrastructure – partnering closely with Infra SREs

Build self-serve tooling and automation to reduce operational toil and enable self-serve offerings.

Define and enforce configuration best practices for inference engines (SGLang, TRT-LLM, vLLM etc.) to prevent runtime issues

Lead incident response, conduct postmortems, and drive reliability improvements

Mentor team members and potentially grow into hiring/team building as the organization scales

Partner with infrastructure and ML engineering teams to improve system reliability and cost efficiency

Qualifications

Minimum

5+ years operating production ML inference or training systems at scale

2+ years in senior IC or tech lead roles, with demonstrated mentorship and technical leadership experience. Having built or scaled teams is a plus.

Deep expertise with Kubernetes, multi-cluster orchestration, and ML serving frameworks

Experience with multi-tenant SaaS platforms

Proven track record of SLA ownership with specific metrics (99.9% uptime, p99 latency targets)

Customer escalation and incident communication experience

Experience with LLM inference serving systems (SGLang, vLLM, TRT-LLM, or similar)

Ability to influence cross-functional teams and make deployment/architecture decisions

Preferred

Experience building internal developer platforms or self-serve tooling

Background in cost optimization for GPU infrastructure

Contributions to open-source ML infrastructure projects