Software Development Engineer, EC2 Instance Networking

About the job

Join our team building the scale-out networking backbone that powers the world's largest AI training clusters. We're developing high-performance RDMA and RoCE solutions that enable distributed training of trillion-parameter models across thousands of compute nodes on AWS infrastructure.

Responsibilities

Design and develop high-performance networking software solutions utilizing RDMA and RoCE technologies for large-scale AI clusters

Integrate SmartNIC acceleration hardware with EC2 control plane systems and APIs

Implement and optimize collective communication patterns for distributed AI training workloads

Develop comprehensive performance monitoring, metrics collection, and benchmarking tools for high-bandwidth cluster interconnects

Create automated testing frameworks and stress testing tools for multi-rack distributed systems

Debug complex system-level issues across hardware acceleration, kernel networking, and distributed applications

Collaborate on architecture decisions for next-generation scale-out AI infrastructure

Participate in design reviews, code reviews, and technical documentation

Qualifications

Minimum

3+ years of non-internship professional software development experience

2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience

Strong programming skills in C/C++ with focus on high-performance systems

Experience with RDMA technologies and RoCE implementations

Familiarity with collective communication libraries (NCCL, RCCL, OneCCL, MPI)

Experience with Linux networking, kernel development, and distributed systems

Understanding of high-performance computing clusters and parallel programming

Preferred

3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience

Bachelor's degree in computer science or equivalent

Experience with SmartNIC programming and network acceleration hardware APIs

Knowledge of large-scale AI training infrastructure and multi-rack cluster networking

Experience with performance optimization, benchmarking, and system-level debugging

Understanding of AI accelerator architectures and scale-out communication patterns

Experience with cloud infrastructure integration and virtualization technologies

Bachelor's degree in Computer Science, Computer Engineering, or related field

Strong problem-solving skills and experience with complex distributed systems

Proficiency in design and analysis of algorithms and data structures

Linux operating system knowledge

In-depth knowledge of TCP/IP

Kernel or embedded development, particularly Linux kernel

Strong knowledge of Computer Science fundamentals in data structures, algorithm design, problem solving, and complexity analysis

Knowledge of, at least, one modern programming language such as C, C++, rust, Python or Perl

Experience developing complex software systems that have been successfully delivered to customers

Knowledge of professional software engineering practices & best practices for the full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations

Ability to take a project from scoping requirements through actual launch of the project

Experience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs

Experiencing mentoring junior software development engineers and driving engineering excellence