Multi-Modal Masked Autoencoders for Learning Image-Spectrum Associations for Galaxy Evolution and Cosmology

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Addressing the challenges of modality heterogeneity and severe label scarcity in galaxy imaging and spectroscopic data, this work introduces the first large-scale paired multimodal galaxy dataset (134,000 image–spectrum pairs) and proposes a Transformer-based Multimodal Masked Autoencoder (MMAE). MMAE enables unsupervised cross-modal representation learning via joint masked reconstruction of images and spectra, supporting morphological recovery, emission-line reconstruction, and continuum spectral slope estimation under missing-modality conditions. In redshift regression, MMAE substantially outperforms unimodal baselines and matches or exceeds state-of-the-art supervised methods. This work pioneers the application of masked modeling to astronomical multimodal learning, establishing a scalable and robust cross-modal representation paradigm for galaxy evolution modeling and foundational astronomical model development.

Technology Category

Application Category

📝 Abstract

Upcoming surveys will produce billions of galaxy images but comparatively few spectra, motivating models that learn cross-modal representations. We build a dataset of 134,533 galaxy images (HSC-PDR2) and spectra (DESI-DR1) and adapt a Multi-Modal Masked Autoencoder (MMAE) to embed both images and spectra in a shared representation. The MMAE is a transformer-based architecture, which we train by masking 75% of the data and reconstructing missing image and spectral tokens. We use this model to test three applications: spectral and image reconstruction from heavily masked data and redshift regression from images alone. It recovers key physical features, such as galaxy shapes, atomic emission line peaks, and broad continuum slopes, though it struggles with fine image details and line strengths. For redshift regression, the MMAE performs comparably or better than prior multi-modal models in terms of prediction scatter even when missing spectra in testing. These results highlight both the potential and limitations of masked autoencoders in astrophysics and motivate extensions to additional modalities, such as text, for foundation models.

Problem

Research questions and friction points this paper is trying to address.

Learning cross-modal representations for galaxy images and spectra

Reconstructing missing spectral and image data from masked inputs

Predicting galaxy redshifts using only image data without spectra

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal masked autoencoder for cross-modal representation

Transformer architecture reconstructs masked image and spectral tokens

Shared embedding enables redshift regression from images alone

🔎 Similar Papers

No similar papers found.