CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the critical challenge of GPU cross-architecture code portability. We present the first unified translation stack supporting bidirectional, source-level CUDA↔HIP and assembly-level NVIDIA SASS↔AMD RDNA3 translation. Our approach introduces a multi-granularity, execution-verified dataset (70K code pairs), a domain-adapted Transformer-based language model family, a dual-granularity (source–assembly) alignment method, and an execution-driven automatic verification framework. We release CASS-Bench—a 16-domain benchmark suite—and an open-source toolchain. Experiments show 95% accuracy for source-code translation and a BLEU score of 37.5% for assembly translation. Generated code achieves native performance in over 85% of test cases, significantly outperforming baselines including GPT-4o, Claude, and Hipify.

Technology Category

Application Category

📝 Abstract

We introduce exttt{CASS}, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~$leftrightarrow$~HIP) and assembly-level (Nvidia SASS~$leftrightarrow$~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the exttt{CASS} family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce exttt{CASS-Bench}, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on href{https://huggingface.co/datasets/MBZUAI/cass}{ extcolor{blue}{HuggingFace}}, with code at href{https://github.com/GustavoStahl/CASS}{ extcolor{blue}{GitHub}}.

Problem

Research questions and friction points this paper is trying to address.

Cross-architecture GPU code transpilation between Nvidia and AMD

Lack of large-scale dataset for low-level GPU code portability

Performance and accuracy gaps in existing translation tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for GPU code transpilation

Domain-specific models achieve high accuracy

Open-source benchmark for rigorous evaluation

🔎 Similar Papers

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper