Semantic search for 100M+ galaxy images using AI-generated captions

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Astronomical research faces a critical bottleneck due to the scarcity of labeled data and low efficiency of manual visual inspection. Method: We propose AION-Search—the first zero-shot semantic search engine for 140 million galaxy images—leveraging astronomy-aware image descriptions generated by vision-language models (e.g., BLIP-2), contrastive alignment training to construct a scalable, cross-modal embedding space, and a VLM-driven semantic re-ranking mechanism. Contribution/Results: AION-Search achieves, for the first time, high-recall zero-shot semantic retrieval of rare astronomical phenomena (e.g., tidal tails, ring galaxies). It nearly doubles top-100 recall on the most challenging targets and enables millisecond-scale real-time retrieval over the full 140M-image corpus—significantly outperforming conventional image similarity search and establishing new state-of-the-art performance under zero-shot settings.

Technology Category

Application Category

📝 Abstract
Finding scientifically interesting phenomena through slow, manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained multimodal astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search scalable to 140 million galaxy images, enabling discovery from previously infeasible searches. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at https://github.com/NolanKoblischke/AION-Search
Problem

Research questions and friction points this paper is trying to address.

Enables semantic search for 140 million unlabeled galaxy images
Uses AI-generated captions to find rare astronomical phenomena efficiently
Makes large scientific image archives searchable without manual labeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging Vision-Language Models to generate galaxy image descriptions
Contrastively aligning a multimodal astronomy model with embedded descriptions
Introducing a VLM-based re-ranking method to double recall for rare targets
🔎 Similar Papers
No similar papers found.
N
Nolan Koblischke
University of Toronto
Liam Parker
Liam Parker
UC Berkeley / Polymathic AI
CosmologyAstrophysicsMachine Learning
F
Francois Lanusse
Université Paris-Saclay, Université Paris Cité, CEA, CNRS, AIM
I
Irina Espejo Morales
New York University
J
Jo Bovy
University of Toronto
Shirley Ho
Shirley Ho
Flatiron Institute, Center for Computational Astrophysics
CosmologyAstrophysicsMachine LearningStatistics