Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Extreme events—such as stock market crashes, earthquakes, and pandemics—are rare, catastrophic, and exhibit system-wide propagation, leading to severe data scarcity that undermines data-driven modeling. To address this, we present the first systematic survey of synthetic data generation methods tailored to extreme events and propose the first dedicated generative framework for extremely rare events. We design a customized evaluation suite encompassing statistical fidelity, dependency preservation, visual plausibility, and task-oriented utility, rigorously analyzing metric validity under heavy-tailed distributions. Our framework unifies generative models (GANs, diffusion models, VAEs), large language models, statistical modeling, and targeted resampling strategies. We curate benchmark datasets across finance, meteorology, geoscience, and epidemiology, identifying underexplored domains—including behavioral finance, wildfire dynamics, and windstorm modeling—and distill key open challenges to advance the reliability and practicality of extreme-event modeling.

Technology Category

Application Category

📝 Abstract
Extreme events, such as market crashes, natural disasters, and pandemics, are rare but catastrophic, often triggering cascading failures across interconnected systems. Accurate prediction and early warning can help minimize losses and improve preparedness. While data-driven methods offer powerful capabilities for extreme event modeling, they require abundant training data, yet extreme event data is inherently scarce, creating a fundamental challenge. Synthetic data generation has emerged as a powerful solution. However, existing surveys focus on general data with privacy preservation emphasis, rather than extreme events' unique performance requirements. This survey provides the first overview of synthetic data generation for extreme events. We systematically review generative modeling techniques and large language models, particularly those enhanced by statistical theory as well as specialized training and sampling mechanisms to capture heavy-tailed distributions. We summarize benchmark datasets and introduce a tailored evaluation framework covering statistical, dependence, visual, and task-oriented metrics. A central contribution is our in-depth analysis of each metric's applicability in extremeness and domain-specific adaptations, providing actionable guidance for model evaluation in extreme settings. We categorize key application domains and identify underexplored areas like behavioral finance, wildfires, earthquakes, windstorms, and infectious outbreaks. Finally, we outline open challenges, providing a structured foundation for advancing synthetic rare-event research.
Problem

Research questions and friction points this paper is trying to address.

Synthetic data generation for rare extreme events
Addressing scarcity of training data for extreme events
Evaluating models for heavy-tailed distributions in extreme events
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative modeling for heavy-tailed distributions
LLMs enhanced by statistical theory
Tailored evaluation framework for extreme events
🔎 Similar Papers
No similar papers found.
J
Jingyi Gu
New Jersey Institute of Technology, Newark, NJ, USA
X
Xuan Zhang
New Jersey Institute of Technology, Newark, NJ, USA
Guiling Wang
Guiling Wang
University of Connecticut
Water CycleClimate ChangeClimate ExtremesEcosystemLand-Atmosphere Interactions