🤖 AI Summary
Self-supervised representation learning in physical sciences—particularly in domains reliant on stochastic simulators (e.g., high-energy physics experiments)—faces challenges in generating physically consistent, diverse augmentations without compromising interpretability or fidelity.
Method: We propose RS3L, a framework that performs *controlled interventions at intermediate layers* of the simulation pipeline and *re-executes downstream components*, yielding physically coherent multi-realization augmented samples. This “intermediate intervention + downstream resimulation” mechanism uniquely embeds domain-specific physical priors deeply into data augmentation, enabling interpretable, coverage-complete, simulation-driven contrastive learning. RS3L integrates causal augmentation with foundation model pretraining.
Contribution/Results: RS3L significantly improves object discrimination accuracy and uncertainty quantification capability. We publicly release the RS3L benchmark dataset, establishing a new paradigm for simulation-driven scientific AI.
📝 Abstract
Self-supervised learning (SSL) is at the core of training modern large machine learning models, providing a scheme for learning powerful representations that can be used in a variety of downstream tasks. However, SSL strategies must be adapted to the type of training data and downstream tasks required. We propose resimulation-based self-supervised representation learning (RS3L), a novel simulation-based SSL strategy that employs a method of to drive data augmentation for contrastive learning in the physical sciences, particularly, in fields that rely on stochastic simulators. By intervening in the middle of the simulation process and rerunning simulation components downstream of the intervention, we generate multiple realizations of an event, thus producing a set of augmentations covering all physics-driven variations available in the simulator. Using experiments from high-energy physics, we explore how this strategy may enable the development of a foundation model; we show how RS3L pretraining enables powerful performance in downstream tasks such as discrimination of a variety of objects and uncertainty mitigation. In addition to our results, we make the RS3L dataset publicly available for further studies on how to improve SSL strategies.
Published by the American Physical Society
2025