About the job
Join our team building the scale-out networking backbone that powers the world's largest AI training clusters. We're developing high-performance RDMA and RoCE solutions that enable distributed training of trillion-parameter models across thousands of compute nodes on AWS infrastructure.
Responsibilities
Design and develop high-performance networking software solutions utilizing RDMA and RoCE technologies for large-scale AI clusters
Integrate SmartNIC acceleration hardware with EC2 control plane systems and APIs
Implement and optimize collective communication patterns for distributed AI training workloads
Develop comprehensive performance monitoring, metrics collection, and benchmarking tools for high-bandwidth cluster interconnects
Create automated testing frameworks and stress testing tools for multi-rack distributed systems
Debug complex system-level issues across hardware acceleration, kernel networking, and distributed applications
Collaborate on architecture decisions for next-generation scale-out AI infrastructure
Participate in design reviews, code reviews, and technical documentation
Qualifications
Minimum
3+ years of non-internship professional software development experience
2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
Strong programming skills in C/C++ with focus on high-performance systems
Experience with RDMA technologies and RoCE implementations
Familiarity with collective communication libraries (NCCL, RCCL, OneCCL, MPI)
Experience with Linux networking, kernel development, and distributed systems
Understanding of high-performance computing clusters and parallel programming
Preferred
3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Bachelor's degree in computer science or equivalent
Experience with SmartNIC programming and network acceleration hardware APIs
Knowledge of large-scale AI training infrastructure and multi-rack cluster networking
Experience with performance optimization, benchmarking, and system-level debugging
Understanding of AI accelerator architectures and scale-out communication patterns
Experience with cloud infrastructure integration and virtualization technologies
Bachelor's degree in Computer Science, Computer Engineering, or related field
Strong problem-solving skills and experience with complex distributed systems
Proficiency in design and analysis of algorithms and data structures
Linux operating system knowledge
In-depth knowledge of TCP/IP
Kernel or embedded development, particularly Linux kernel
Strong knowledge of Computer Science fundamentals in data structures, algorithm design, problem solving, and complexity analysis
Knowledge of, at least, one modern programming language such as C, C++, rust, Python or Perl
Experience developing complex software systems that have been successfully delivered to customers
Knowledge of professional software engineering practices & best practices for the full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations
Ability to take a project from scoping requirements through actual launch of the project
Experience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs
Experiencing mentoring junior software development engineers and driving engineering excellence