🤖 AI Summary
Target variable transformation—a critical yet long-overlooked preprocessing step in machine learning regression—lacks principled guidance, leading to suboptimal model performance. Method: We systematically investigate its impact mechanisms via empirical case studies, statistical diagnostics (e.g., residual distribution tests, heteroscedasticity detection), and domain-informed heuristic reasoning. Contribution/Results: We propose the first actionable decision framework—“when and how to transform”—grounded in empirically derived applicability criteria (e.g., skewness, scale imbalance, nonlinear effects, temporal trend contamination). We further formulate generalizable heuristics mapping common data issues (e.g., population-size bias, inflation drift, score compression) to optimal transformations (e.g., log, Box–Cox, quantile normalization). Extensive experiments demonstrate substantial improvements in model fit accuracy and out-of-sample stability, thereby bridging a key theoretical and practical gap in ML pipeline design—specifically, target-variable preprocessing.
📝 Abstract
The machine learning pipeline typically involves the iterative process of (1) collecting the data, (2) preparing the data, (3) learning a model, and (4) evaluating a model. Practitioners recognize the importance of the data preparation phase in terms of its impact on the ability to learn accurate models. In this regard, significant attention is often paid to manipulating the feature set (e.g., selection, transformations, dimensionality reduction). A point that is less well appreciated is that transformations on the target variable can also have a large impact on whether it is possible to learn a suitable model. These transformations may include accounting for subject-specific biases (e.g., in how someone uses a rating scale), contexts (e.g., population size effects), and general trends (e.g., inflation). However, this point has received a much more cursory treatment in the existing literature. The goal of this paper is three-fold. First, we aim to highlight the importance of this problem by showing when transforming the target variable has been useful in practice. Second, we will provide a set of generic ``rules of thumb'' that indicate situations when transforming the target variable may be needed. Third, we will discuss which transformations should be considered in a given situation.