BRYT: Data Rich Analytics Based Computer Architecture for A New Paradigm of Chip Design to Supplant Moore's Law

📅 2023-12-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the on-chip analysis bottleneck in the post-Moore era, this paper introduces a data-enriched architectural paradigm to resolve three core challenges in hardware introspection: difficulty of in-situ A/B testing, ambiguity between hardware and software behavior, and lack of fine-grained observability. Methodologically, we propose the first embedded lightweight analysis unit (YPU), integrating RTL-level telemetry circuits, a streaming analytics engine, and workload-aware compression/triggering mechanisms—achieving <1% area overhead and <25 mW power consumption in 7 nm technology. Our key contributions include the first zero-overhead, instruction-level cycle-stack tracing; in-field prefetcher evaluation; module-level cycle-utilization heatmaps; and real-time AI tensor-value distribution histograms. Evaluated across four representative case studies under realistic workloads, the YPU demonstrates both efficacy and practical utility for production-grade chip analysis.
📝 Abstract
Motivated by the end of Moore's Law and Dennard Scaling which necessitate architectural efficiency as the means for improved capability for the next decade or two, this paper introduces a new data-rich paradigm of chip design for the semi-conductor industry. The goal is to enable monitoring chip hardware behavior in the field, at real-time speeds with no slowdowns, with minimal power overheads and obtain insights on chip behavior and workloads. We posit that, such extensive amounts of data would allow better and more capable architectures addressing three problems: obfuscated hardware, obfuscated software, and inability of A/B testing for hardware ideas. This paper implements the first version of the paradigm with a system architecture and the concept of an analYtics Processing Unit (YPU). We perform 4 case studies, and implement an RTL level prototype. Across the case studies we show a YPU with area overhead $<1 %$ at 7nm, and overall power consumption of $<25 mW$ is able to create previously inconceivable analysis: per-instruction cycles stacks of arbitrary programs, evaluating instruction prefetchers in the wild before deployment, fine-grained cycle-by-cycle utilization of hardware modules, and histograms of tensor-value distributions of DL models.
Problem

Research questions and friction points this paper is trying to address.

Monitoring chip hardware behavior during real workloads
Enabling hardware introspection without performance slowdowns
Solving A/B testing and obfuscation challenges in chips
Innovation

Methods, ideas, or system contributions that make the work stand out.

IPU enables real-time hardware introspection without slowdowns
IPU achieves less than 1% area overhead at 7nm
IPU consumes under 25mW power for hardware analysis
🔎 Similar Papers
No similar papers found.
I
Ian McDougall
University of Wisconsin-Madison
S
Shayne Wadle
University of Wisconsin-Madison
H
Harish Batchu
University of Wisconsin-Madison
Michael Davies
Michael Davies
NVIDIA
Hardware ArchitectureComputational ScienceOperating SystemsHigh-Performance Computing
Karthikeyan Sankaralingam
Karthikeyan Sankaralingam
Professor of Computer Science, University of Wisconsin-Madison
Computer Architecture