🤖 AI Summary
This work addresses the challenge of reconstructing densely overlapping neutrino events in the TeV energy regime, where traditional methods are hindered by scarce annotations and diverse downstream tasks. For the first time, we introduce the foundation model paradigm to high-energy neutrino detection, proposing a self-supervised pretraining framework based on a sparse Vision Transformer. Our approach integrates masked autoencoding with voxel-level relational modeling to learn transferable representations from heterogeneous detector data, followed by multi-task joint fine-tuning to enhance performance on downstream tasks. Evaluated on FASERCal simulation data, our model achieves flavor identification accuracy comparable to that of a randomly initialized model trained on ten times more labeled samples—using only ∼10³ annotated events—and matches or surpasses existing methods on public benchmarks.
📝 Abstract
Accelerator-based neutrino physics is entering an energy-frontier regime in which interactions reach the TeV scale and produce exceptionally dense, overlapping detector signatures. In this regime, event interpretation becomes impractical for conventional reconstruction approaches, particularly when labelled data are scarce and the analysis spans diverse downstream objectives. We present a sparse ViT framework for learning reusable representations from heterogeneous detector data. Self-supervised pre-training combines masked autoencoder reconstruction with relational voxel-level objectives for hierarchy, ghost and particle identification, and the resulting shared encoder is then jointly fine-tuned across classification and regression tasks. Evaluated on simulated events from the proposed FASERCal concept at the LHC, we find that pre-training consistently improves neutrino flavour and charm-quark identification, momentum regression, and vertex reconstruction over training from scratch, with the addition of relational objectives yielding further gains in the most topologically complex channels. Interpretability analyses further show that pre-training yields a more structured latent space, while detector-subsystem ablations recover physically plausible channel-dependent roles for the heterogeneous inputs. A data-efficiency study shows that, with roughly $10^3$ labelled events, the pre-trained encoder already matches the flavour-classification performance of a randomly initialised model trained on an order of magnitude more data. The learned representations also transfer effectively to publicly available benchmarks spanning different detector technologies and energy scales, matching or exceeding published baselines. These results support self-supervised pre-training on multimodal detector data as a scalable route towards reusable representations for neutrino and particle-detector analysis.