Member of Technical Staff - Edge Inference Engineer

About the job

Our Edge Inference team compiles Liquid Foundation Models into optimized machine code that runs on resource-constrained devices: phones, laptops, Raspberry Pis, and watches. We are core contributors to llama.cpp and build the infrastructure that makes efficient on-device AI possible. You will work directly with the technical lead on problems that require deep understanding of both ML architectures and hardware constraints. This is high-ownership work where your code ships to production and directly impacts model performance on real devices.

Responsibilities

Implement and optimize inference kernels for CPU, NPU, and GPU architectures across diverse edge hardware

Develop quantization strategies (INT4, INT8, FP8) that maximize compression while preserving model quality under strict memory budgets

Contribute to llama.cpp and other open-source inference frameworks, including new model architectures (audio, vision)

Profile and optimize end-to-end inference pipelines to achieve sub-100ms time-to-first-token on target devices

Collaborate with ML researchers to understand model architectures and identify optimization opportunities specific to Liquid Foundation Models

Qualifications

Minimum

5+ years of experience in systems programming with strong C++ proficiency

Embedded software engineering experience or work on resource-constrained systems

Understanding of ML fundamentals at the linear algebra level (how matrix operations, attention, and quantization work)

Experience with hardware architecture concepts: cache hierarchies, memory bandwidth, SIMD/vectorization

Preferred

Contributions to llama.cpp, ExecuTorch, or similar inference frameworks

Experience with Rust for systems programming

Background in custom accelerator development (TPU, NPU) or work at companies like SambaNova, Cerebras, Groq, or Google/Amazon accelerator teams

Quantitative degree (mathematics, physics, or similar) combined with engineering experience