🤖 AI Summary
This work addresses the frequent lack of systematic and credible statistical evaluation in ECE/CS research, which often undermines the persuasiveness of empirical claims. To bridge this gap, we propose a structured statistical evaluation workflow tailored for beginners, integrating classical methods—such as t-tests and ANOVA—with modern nonparametric techniques, including bootstrap resampling, Wilcoxon tests, and Cliff’s delta. The framework spans the entire pipeline from formulating research claims to reporting results, supporting factorial designs, multiple comparison corrections, and simulation-based validation. Accompanying the methodology are fully reproducible Python implementations, illustrative examples, and a pre-submission checklist. This approach substantially enhances the reliability and reproducibility of experimental findings while offering both pedagogical utility and practical guidance for researchers.
📝 Abstract
Strong experimental papers in electrical and computer engineering and computer science (ECE/CS), especially in systems, networking, and applied machine learning, rest on more than a single impressive number. They rest on a chain of design, measurement, analysis, and validation choices that, taken together, make a result believable. This tutorial is a compact, example-driven guide to that chain for beginning researchers. We organize it as an evaluation workflow: claim, hypothesis, unit of analysis, baseline, regime sweep, uncertainty estimate, validation check, and reporting. Within that workflow we cover the classical statistical foundations (descriptive statistics, the central limit theorem, normal- and $t$-based confidence intervals, Student's $t$-test, ANOVA, chi-squared and Pearson correlation, linear regression) alongside the modern, distribution-free techniques (the bootstrap, Wilcoxon and Mann--Whitney tests, Cliff's delta) that are usually preferred for ECE/CS data. We also discuss factorial design, randomization and blocking, multiple-comparison correction, latency-specific pitfalls, simulation verification and validation, equivalence-style claims, and reproducibility. A running example, a comparison of two job-scheduling algorithms on simulated workloads with truncated heavy-tailed job sizes, threads through the tutorial, with Python snippets the reader can paste and adapt. The paper closes with a pre-submission checklist; companion student-facing material (project-type translation tables, an evaluation-plan worksheet, exercises, and a worked ``bad evaluation autopsy'') is collected in a separate workbook released alongside this paper.