🤖 AI Summary
This work addresses the lack of synthetic data stream generators capable of simulating real-world temporal dynamics with controllable distribution shifts and evolving causal relationships. The authors propose a time-varying synthetic data generation framework grounded in structural causal models (SCMs), which explicitly models the evolution of causal mechanisms between features and targets by dynamically adjusting SCM mapping functions and incorporating causal interventions. This approach generates non-stationary data streams that jointly exhibit covariate shift and concept drift. Notably, it is the first framework to integrate time-varying causal mechanisms with controllable drift, enabling simultaneous simulation of abrupt perturbations and gradual evolutions. Experiments demonstrate that the generated data streams exhibit realistic drift characteristics and effectively reveal performance degradation and recovery patterns of machine learning models under distributional shifts, thereby providing a reliable benchmark for evaluating model robustness.
📝 Abstract
This work presents Causal Drift Generator (CaDrift), a time-dependent synthetic data generator framework based on Structural Causal Models (SCMs). The framework produces a virtually infinite combination of data streams with controlled shift events and time-dependent data, making it a tool to evaluate methods under evolving data. CaDrift synthesizes various distributional and covariate shifts by drifting mapping functions of the SCM, which change underlying cause-and-effect relationships between features and the target. In addition, CaDrift models occasional perturbations by leveraging interventions in causal modeling. Experimental results show that, after distributional shift events, the accuracy of classifiers tends to drop, followed by a gradual retrieval, confirming the generator's effectiveness in simulating shifts. The framework has been made available on GitHub.