Post-Training Platform Infrastructure Engineer

About the job

We are looking for a systems-minded engineer who lives at the intersection of large-scale model inference, distributed systems, and performance optimization. This role focuses on post-training and inference infrastructure, with particular emphasis on P/D disaggregation, KV cache lifecycle management, and efficient offloading mechanisms across both inference and reinforcement learning (RL) systems.

Responsibilities

Research and deeply understand modern LLM inference frameworks, including: Architecture and design tradeoffs of P/D (prefill / decode) disaggregation; KV cache lifecycle, memory layout, eviction strategies, and reuse; KV cache offloading mechanisms across GPU, CPU, and storage backends; Analyze and compare inference execution paths to identify: Performance bottlenecks (latency, throughput, memory pressure); Inefficiencies in scheduling, cache management, and resource utilization; Develop and implement infrastructure-level features to: Improve inference latency, throughput, and memory efficiency; Optimize KV cache management and offloading strategies; Enhance scalability across multi-GPU and multi-node deployments; Apply the same research-driven approach to RL frameworks: Study post-training and RL systems (e.g., policy rollout, inference-heavy loops); Debug performance and correctness issues in distributed RL pipelines; Optimize inference, rollout efficiency, and memory usage during training; Collaborate with research and applied ML teams to: Translate model-level requirements into infrastructure capabilities; Validate performance gains with benchmarks and real workloads; Document findings, architectural insights, and best practices to guide future system design

Qualifications

Minimum

No minimum qualifications listed.

Preferred

Strong background in systems engineering, distributed systems, or ML infrastructure; Hands-on experience with GPU-accelerated workloads and memory-constrained systems; Solid understanding of: LLM inference workflows (prefill vs decode); Attention mechanisms and KV cache behavior; Multi-process / multi-GPU execution models; Proficiency in Python and C++ (or similar systems languages); Experience debugging performance issues using profiling tools (GPU, CPU, memory); Ability to read, understand, and modify complex open-source codebases; Strong analytical skills and comfort working in research-heavy, ambiguous problem spaces; Direct experience with LLM inference frameworks or serving stacks; Familiarity with: GPU memory hierarchies (HBM, pinned memory, NUMA considerations); KV cache compression, paging, or eviction strategies; Storage-backed offloading (NVMe, object stores, distributed file system); Experience with distributed RL or post-training pipelines; Knowledge of scheduling systems, async execution, or actor-based runtimes; Contributions to open-source ML or systems projects; Experience designing benchmarking suites or performance evaluation frameworks