Senior AI Performance Architect

Qualcomm
Raleigh, North Carolina, United States of America / Santa Clara, California, United States of America / San Diego, California, United States of America2026-01-18onsite

About the job

We are looking for AI Accelerator Architecture Engineers to drive functional, performance and power enhancements into the HW to enable state of the art training capabilities. AI inference and training systems must scale to a large number of accelerators, servers and racks. Our devices must be designed to scale to handle the largest of today's models. The AI Architecture team is comprised of experts that span the full gamut from software architecture, algorithm development, kernel optimization, down to hardware accelerator block architecture and SOC design. The ideal candidate will augment the team by contributing to one or many of these areas.

Responsibilities

Understand trends in ML network design through customer engagements and latest academic research and determine how this will affect both SW and HW design

Work with customers to determine hardware requirements for AI training systems

Analysis of current accelerator and GPU architectures

Architect enhancements required for efficient training of AI models

Design and architecture of:

Flexible Computational Blocks

Involving a variety of datatypes : floating point, fixed point, microscaling

Involving a variety of precision : 32/16/8/4/2/1

Capable of optimally performing dense and sparse GEMM, GEMV

Memory Technology and subystems that are optimized for a range of requirements

Capacity

Bandwidth

Compute in Memory, Compute near memory

Scale-Out and Scale-Up Architectures

Switches, NoCs, Codesign with Communication Collectives

Optimized for Power

Ability to perform Competitive Analysis

Codesign HW with SW/GenAI (LLM) requirements

Define performance models to prove effectiveness of architecture proposals

Pre-Silicon prediction of performance for various ML training workloads

Perform analysis of performance/area/power trade-offs for future HW and SW ML algorithms including impact of SOC components (memory and bus impacts)

Qualifications

Minimum

• Bachelor's degree in Computer Science, Engineering, Information Systems, or related field and 2+ years of Hardware Engineering, Software Engineering, Systems Engineering, or related work experience.

OR

Master's degree in Computer Science, Engineering, Information Systems, or related field and 1+ year of Hardware Engineering, Software Engineering, Systems Engineering, or related work experience.

OR

PhD in Computer Science, Engineering, Information Systems, or related field.

Preferred

Knowledge of computer architecture, digital circuits and hardware simulators

Knowledge of communication protocols used in AI systems

Knowledge of Network-on-Chip (NoC) designs used in System-on-Chip (SoC) designs

Understanding of various memory technologies used in AI systems

Experience in modeling hardware and workloads in order to extract performance and power estimates

High-level hardware modeling experience preferred

Knowledge of AI Training systems such as NVIDIA DGX and NVL72

Experience training and finetuning LLMs using distributed training framework such as DeepSpeed, FSDP

Knowledge of front-end ML frameworks (i.e.,TensorFlow, PyTorch) used for training of ML models

Strong communication skills (written and verbal)

Detail-oriented with strong problem-solving, analytical and debugging skills

Demonstrated ability to learn, think and adapt in a fast-changing environment

Ability to code in C++ and Python

Knowledge of a variety of classes of ML models (i.e. CNN, RNN, etc)