Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of modality heterogeneity and severe label scarcity in galaxy imaging and spectroscopic data, this work introduces the first large-scale paired multimodal galaxy dataset (134,000 image–spectrum pairs) and proposes a Transformer-based Multimodal Masked Autoencoder (MMAE). MMAE enables unsupervised cross-modal representation learning via joint masked reconstruction of images and spectra, supporting morphological recovery, emission-line reconstruction, and continuum spectral slope estimation under missing-modality conditions. In redshift regression, MMAE substantially outperforms unimodal baselines and matches or exceeds state-of-the-art supervised methods. This work pioneers the application of masked modeling to astronomical multimodal learning, establishing a scalable and robust cross-modal representation paradigm for galaxy evolution modeling and foundational astronomical model development.

Technology Category

Application Category

📝 Abstract
Upcoming surveys will produce billions of galaxy images but comparatively few spectra, motivating models that learn cross-modal representations. We build a dataset of 134,533 galaxy images (HSC-PDR2) and spectra (DESI-DR1) and adapt a Multi-Modal Masked Autoencoder (MMAE) to embed both images and spectra in a shared representation. The MMAE is a transformer-based architecture, which we train by masking 75% of the data and reconstructing missing image and spectral tokens. We use this model to test three applications: spectral and image reconstruction from heavily masked data and redshift regression from images alone. It recovers key physical features, such as galaxy shapes, atomic emission line peaks, and broad continuum slopes, though it struggles with fine image details and line strengths. For redshift regression, the MMAE performs comparably or better than prior multi-modal models in terms of prediction scatter even when missing spectra in testing. These results highlight both the potential and limitations of masked autoencoders in astrophysics and motivate extensions to additional modalities, such as text, for foundation models.
Problem

Research questions and friction points this paper is trying to address.

Learning cross-modal representations for galaxy images and spectra
Reconstructing missing spectral and image data from masked inputs
Predicting galaxy redshifts using only image data without spectra
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal masked autoencoder for cross-modal representation
Transformer architecture reconstructs masked image and spectral tokens
Shared embedding enables redshift regression from images alone
🔎 Similar Papers
No similar papers found.
M
Morgan Himes
Department of Physics and Astronomy, UCLA, Los Angeles, CA 90095
S
Samiksha Krishnamurthy
Department of Electrical and Computer Engineering, UCLA, Los Angeles, CA 90095
Andrew Lizarraga
Andrew Lizarraga
PhD Student @ UCLA
Representation LearningGenerative ModelingStatistics
S
Srinath Saikrishnan
Department of Computer Science, UCLA, Los Angeles, CA 90095
V
Vikram Seenivasan
Department of Physics and Astronomy, UCLA, Los Angeles, CA 90095
J
Jonathan Soriano
Department of Physics and Astronomy, UCLA, Los Angeles, CA 90095
Ying Nian Wu
Ying Nian Wu
UCLA Department of Statistics and Data Science
Generative AIRepresentation learningComputer visionComputational neuroscienceBioinformatics
T
Tuan Do
Department of Physics and Astronomy, UCLA, Los Angeles, CA 90095