🤖 AI Summary
In empirical social science and medical research, multiple hypothesis testing inflates the risk of false positives; conventional methods (e.g., Bonferroni) control the family-wise error rate (FWER) but are overly conservative, severely reducing statistical power. While sequential procedures (e.g., Holm, Hochberg) improve power, they ignore prevalent logical or causal hierarchical structures among hypotheses. This paper proposes a novel hierarchical multiple testing framework that innovatively integrates structured strategies—including fixed-sequence, fallback, and gatekeeping procedures—and systematically compares their statistical properties with classical approaches. Results demonstrate that the framework maintains strict FWER control while substantially enhancing statistical power, improving inferential interpretability and practical applicability. It thus offers a theoretically rigorous yet operationally feasible paradigm for hierarchical empirical studies, such as clinical trials and policy evaluations.
📝 Abstract
Empirical research in the social and medical sciences frequently involves testing multiple hypotheses simultaneously, increasing the risk of false positives due to chance. Classical multiple testing procedures, such as the Bonferroni correction, control the family-wise error rate (FWER) but tend to be overly conservative, reducing statistical power. Stepwise alternatives like the Holm and Hochberg procedures offer improved power while maintaining error control under certain dependence structures. However, these standard approaches typically ignore hierarchical relationships among hypotheses - structures that are common in settings such as clinical trials and program evaluations, where outcomes are often logically or causally linked. Hierarchical multiple testing procedures - including fixed sequence, fallback, and gatekeeping methods -explicitly incorporate these relationships, providing more powerful and interpretable frameworks for inference. This paper reviews key hierarchical methods, compares their statistical properties and practical trade-offs, and discusses implications for applied empirical research.