🤖 AI Summary
Traditional boxplots exhibit poor outlier detection performance under skewed or heavy-tailed distributions, suffering from masking (failure to detect true outliers) and swamping (false identification of inliers). To address this, we systematically evaluate and enhance existing skew-robust boxplot variants, proposing a novel robust variant that adjusts whisker boundaries using both quartile-based skewness and median deviation. We implement this method in the open-source R package *ggskewboxplots*, designed for seamless integration with *ggplot2* and enabling distribution-aware visualization. Through Monte Carlo simulations—using power-transformed distributions with controllable skewness—and mosaic-based evaluation, our approach demonstrates substantially improved sensitivity and specificity compared to Tukey’s classical method, significantly reducing both false positives and false negatives. This work constitutes the first unified, open-source, and user-friendly R toolkit integrating multiple skew-adaptive boxplot methods, establishing a new paradigm for robust outlier detection and visualization in non-normal data.
📝 Abstract
Traditional boxplots are widely used for summarizing and visualizing the distribution of numerical data, yet they exhibit significant limitations when applied to skewed or heavy-tailed distributions, often leading to misclassification of outliers through swamping -- flagging typical observations as outliers -- or masking -- failing to detect true outliers. This paper addresses these limitations by systematically evaluating several alternative boxplots specifically designed to accommodate distributional asymmetry. We introduce ggskewboxplots, an R package that integrates multiple robust and skewness-aware boxplot variants, providing a unified and user-friendly framework for exploratory data analysis. Using extensive Monte Carlo simulations under controlled skewness and kurtosis conditions, implemented via the mosaic approach based on the Skewed Exponential Power distribution, we assess the sensitivity and specificity of each method. Simulation results indicate that classical Tukey-style boxplots are highly prone to swamping and masking, whereas robust skewness-adjusted variants -- particularly those leveraging quartile-based skewness measures or medcouple-based adjustments -- achieve substantially better performance. These findings offer practical guidance for selecting reliable boxplot methods in applied settings and demonstrate how the ggskewboxplots package facilitates accessible, distribution-aware visualizations within the familiar ggplot2 workflow.