Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ML performance analysis tools lack fine-grained, machine-code-level visibility into accelerator behavior and fail to deliver actionable optimization guidance. This work introduces xPU-Shark—the first framework to deploy microarchitectural simulators—traditionally used only in hardware design—into production environments. By replaying instruction-level execution traces captured from real deployments, xPU-Shark enables precise, microarchitectural performance diagnosis and targeted optimization. Its methodology integrates hardware-assisted instruction tracing, a customizable cycle-accurate simulator, production trace collection and deterministic replay, and microarchitectural modeling of communication primitives. Experimental evaluation uncovers previously unknown microarchitectural bottlenecks; optimizations guided by xPU-Shark accelerate collective communication by up to 15% and reduce LLM token generation latency by up to 4.1%. These results significantly surpass the capabilities of conventional coarse-grained profiling tools, demonstrating the efficacy of production-deployable microarchitectural simulation for AI accelerator optimization.

Technology Category

Application Category

📝 Abstract
As models become larger, ML accelerators are a scarce resource whose performance must be continually optimized to improve efficiency. Existing performance analysis tools are coarse grained, and fail to capture model performance at the machine-code level. In addition, these tools often do not provide specific recommendations for optimizations. We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level that provides actionable optimization suggestions. Our core insight is to use a hardware-level simulator, an artifact of the hardware design process that we can re-purpose for performance analysis. xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator to gain low-level insights into the model's performance. We implement xPU-Shark for our in-house accelerator and used it to analyze the performance of several of our production LLMs, revealing several previously-unknown microarchitecture inefficiencies. Leveraging these insights, we optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
Problem

Research questions and friction points this paper is trying to address.

Analyze ML accelerator performance at machine-code level
Provide actionable optimization suggestions for ML models
Identify and reduce microarchitecture inefficiencies in accelerators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained ML model analysis at machine-code level
Hardware-level simulator repurposed for performance insights
Actionable optimization suggestions from production deployment traces
🔎 Similar Papers
No similar papers found.