AION-Search is a text-based search engine for galaxy images trained from GPT-4.1-mini descriptions and summaries of HSC and Legacy Survey galaxies.
🔭 Use AION-Search now!
Try the live web demo: AION-Search App
📚 Checkout our results
More details at: Project Page
📦 Explore Our Datasets
Access all data products (embeddings and captions): HuggingFace Datasets
git clone https://github.com/NolanKoblischke/AION-Search.git
cd AION-Search
pip install -e . # or uv pip install -e .This package requires an OpenAI API key for generating text embeddings.
- Get your API key at platform.openai.com
- Create a
.envfile in the project root:OPENAI_API_KEY=<your-api-key>
from aionsearch import AIONSearchClipModel
# Load pretrained model from HuggingFace
model = AIONSearchClipModel.from_pretrained()
# Project AION image embeddings into shared space
aion_embedding = # Embedding of an image using github.com/PolymathicAI/AION
projected_image = model.image_projector(aion_embedding) # (batch, 768) -> (batch, 1024)
# Project OpenAI text embeddings into shared space
text_embedding = # Embedding of text using text-embedding-3-large
projected_text = model.text_projector(text_embedding) # (batch, 3072) -> (batch, 1024)
# Compute similarity for semantic search
similarity = projected_image @ projected_text.TSee examples/quick_start.ipynb for a complete walkthrough that downloads a galaxy image, generates embeddings with AION, and performs text-to-image similarity search.
The research/ directory contains the paper-facing experiment code and small frozen artifacts needed to reproduce the main analyses where redistribution is practical. It includes:
- retrieval experiments for spirals, mergers, and gravitational lenses in
research/src/experiments/retrieval/ - the controlled HSC lens re-ranking experiment in
research/src/experiments/hsc_lens_rerank/ - Galaxy10 zero-shot and MLP-probe experiments in
research/src/experiments/gz10/ - stream-candidate catalog comparison in
research/src/experiments/stream_finding/ - selected cached tables, labels, catalogs, and README assets under
research/data/andresearch/assets/
The research environment can be installed from research/pyproject.toml:
cd research
uv syncSome large inputs and external services are intentionally not vendored into the Git repository. The experiment READMEs document which public Hugging Face artifacts, raw survey products, or API credentials are required for each workflow.
If you find this work useful, please cite:
@misc{koblischke2025semantic,
title={Semantic Search for 100M+ Galaxy Images Using AI-Generated Captions},
author={Nolan Koblischke and Liam Parker and Francois Lanusse and Jo Bovy and Irina Espejo and Shirley Ho},
year={2025},
eprint={2512.11982},
archivePrefix={arXiv},
primaryClass={astro-ph.IM},
url={https://arxiv.org/abs/2512.11982},
}