🤖 AI Summary
Neutrino event classification in high-energy physics—particularly for pixelated detector imagery—remains challenging due to sparse, low-signal data and limited interpretability of conventional deep learning models.
Method: This work introduces, for the first time, vision-language models (VLMs) to this domain. We propose a multimodal architecture built upon fine-tuned LLaMA 3.2 and a vision encoder: detector images are encoded into visual tokens and jointly processed with textual prompts within the VLM, enabling semantic-guided, reasoning-based classification.
Contribution/Results: (1) We pioneer the application of VLMs to particle physics image analysis; (2) our approach achieves superior accuracy and robustness over CNN baselines in distinguishing electron- and muon-type neutrino events, while enhancing model interpretability and cross-event-type generalization; (3) we empirically validate that multimodal fusion significantly improves detection of sparse high-energy physics signals, establishing a new physics-informed AI paradigm for future experimental analysis.
📝 Abstract
Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.