Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Extreme events—such as stock market crashes, earthquakes, and pandemics—are rare, catastrophic, and exhibit system-wide propagation, leading to severe data scarcity that undermines data-driven modeling. To address this, we present the first systematic survey of synthetic data generation methods tailored to extreme events and propose the first dedicated generative framework for extremely rare events. We design a customized evaluation suite encompassing statistical fidelity, dependency preservation, visual plausibility, and task-oriented utility, rigorously analyzing metric validity under heavy-tailed distributions. Our framework unifies generative models (GANs, diffusion models, VAEs), large language models, statistical modeling, and targeted resampling strategies. We curate benchmark datasets across finance, meteorology, geoscience, and epidemiology, identifying underexplored domains—including behavioral finance, wildfire dynamics, and windstorm modeling—and distill key open challenges to advance the reliability and practicality of extreme-event modeling.

Technology Category

Application Category

📝 Abstract

Extreme events, such as market crashes, natural disasters, and pandemics, are rare but catastrophic, often triggering cascading failures across interconnected systems. Accurate prediction and early warning can help minimize losses and improve preparedness. While data-driven methods offer powerful capabilities for extreme event modeling, they require abundant training data, yet extreme event data is inherently scarce, creating a fundamental challenge. Synthetic data generation has emerged as a powerful solution. However, existing surveys focus on general data with privacy preservation emphasis, rather than extreme events' unique performance requirements. This survey provides the first overview of synthetic data generation for extreme events. We systematically review generative modeling techniques and large language models, particularly those enhanced by statistical theory as well as specialized training and sampling mechanisms to capture heavy-tailed distributions. We summarize benchmark datasets and introduce a tailored evaluation framework covering statistical, dependence, visual, and task-oriented metrics. A central contribution is our in-depth analysis of each metric's applicability in extremeness and domain-specific adaptations, providing actionable guidance for model evaluation in extreme settings. We categorize key application domains and identify underexplored areas like behavioral finance, wildfires, earthquakes, windstorms, and infectious outbreaks. Finally, we outline open challenges, providing a structured foundation for advancing synthetic rare-event research.

Problem

Research questions and friction points this paper is trying to address.

Synthetic data generation for rare extreme events

Addressing scarcity of training data for extreme events

Evaluating models for heavy-tailed distributions in extreme events

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative modeling for heavy-tailed distributions

LLMs enhanced by statistical theory

Tailored evaluation framework for extreme events

🔎 Similar Papers

No similar papers found.