🤖 AI Summary
This work investigates robustness degradation in image recognition models deployed on hardware accelerators (CPU/GPU/FPGA/ASIC), caused by defects in DNN compiler toolchains—such as erroneous device code generation and compilation failures. To address this, the authors propose MutateNN, the first mutation testing framework specifically designed for DNN compilers. It defines six hardware-aware mutation operators targeting conditional logic, layer structure, data types, and input configurations, and integrates differential testing with multi-platform deployment to systematically evaluate fault responses of seven mainstream models across four accelerator types. Experiments reveal that conditional logic mutations induce up to 90.3% accuracy deviation; layer modifications, data-type mutations, and input perturbations cause up to 99.8% performance degradation or complete model failure, with highly consistent failure patterns across devices. This work establishes a novel methodology and empirical benchmark for reliability validation of AI models in safety-critical hardware deployments.
📝 Abstract
As the usage of Artificial Intelligence (AI) on resource-intensive and safety-critical tasks increases, a variety of Machine Learning (ML) compilers have been developed, enabling compatibility of Deep Neural Networks (DNNs) with a variety of hardware acceleration devices. However, given that DNNs are widely utilized for challenging and demanding tasks, the behavior of these compilers must be verified. To this direction, we propose MutateNN, a tool that utilizes elements of both differential and mutation testing in order to examine the robustness of image recognition models when deployed on hardware accelerators with different capabilities, in the presence of faults in their target device code - introduced either by developers, or problems in their compilation process. We focus on the image recognition domain by applying mutation testing to 7 well-established DNN models, introducing 21 mutations of 6 different categories. We deployed our mutants on 4 different hardware acceleration devices of varying capabilities and observed that DNN models presented discrepancies of up to 90.3% in mutants related to conditional operators across devices. We also observed that mutations related to layer modification, arithmetic types and input affected severely the overall model performance (up to 99.8%) or led to model crashes, in a consistent manner across devices.